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ABSTRACT 
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Preface 

Recently several reviews of testing practices have appeared (e.g., 
Marjorie C. Kirkland, "The Effects of Tests on Students and Schools," Review 
of Educational Research , 1971, 41, 303-350; Richard C. Anderson, "How to 
Construct Achievement Tests to Assess Comprehension," Review of Educational 
Research , 1972, 42, 145-170; Barton B. Proger and Lester Mann, "Criterion - 
Referenced Measurement: The World of Gray versus Black and White," Journal 
of Learning Disabilities , 1973, 6, 72-84.) The reviews have emphasized either 
standardized tests or criterion-referenced measurement. Such topics are 
receiving the greatest amount of attention from test g experts at present. 
However, before the advent of tests used in either a norm-referenced measurement (NRM) 
or a criterion-referenced measurement (CRM) manner, teachers were forced to con- 
struct their own. Informal devices to assess progress. The reviewers feel that 
informal, teacher-made tests do not legitimately fall into either the NRM or 
CRM categories but rather form a third category of their own. It is unfortunate 
that reviewers of educational research have largely neglected the vast literature 
on informal, teacher-made tests. At the very least, these studies are of interest 
from an historical perspective, in that the seeds for many of the ideas behind 
NRM and CRM were first sown on the informal teacher test domain. This review 
covers the time period from 1913 to 1968 and thus includes the bulk of exposi- 
tion on informal, teacher-made tests, since interest in the NRM and CRM movements 
superceded the former type of tests in the late 1940*s. 

This review is limited to only those articles of either experimental 
nature or of philosophical/theoretical nature that relate to the instructional 
benefits of tests. (The introductory chapter/fully explains the premises behind 
the reviewers' perspective on the teaching values of informal, teacher-made tests.) 
Needless to say, a great deal has been written about the need to check student 
progress by means of teacher-made tests, but most of this literature is based 
only on personal biases of the writers and not on evidence. Thus, the reviewers 
have set minimal criteria that the studies to be included must include empirical 
evidence to support the assertions that teacher-made tests are beneficial to the 
children, or, in lieu of such evidence, must at least contain rational psycho- 
logical learning theory. Twelve topics eventually delineated themselves: 
(1) frequency of testing (34 references); (2) test grading (7 references).; 
(3) test correction modes (11 references); (4) test result feedback (22 references); 
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(5) pretesting (36 references); (6) retesting (9 references); (7) test 
expectation (7 references); (8) test exemption (13 references); (9) student 
preparation modes (16 references); (10) student attitudes toward tests (6 
references); (11) test type (5 references); and (12) "test-like events" 
(19 references). In total 185 references were summarized. 

This project was completed in connection with several testing research 
studies carried out by the^ reviewers and their colleagues from 1967 to the 
present. The reviewers hope this document will prove useful to others In 
understanding a somewhat different component to measurement heritage than 
is commonly recognized In CRM and NRM. 
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INTRODUCTION 



In no case will the review concern itself with standardized test- 
ing; rather, only i n forma 1 , teacher-generated testing activities are 
dealt with. The reason for this procedure is justifiable. Several 
studies have dealt with standardi zed tests, especially in the contexts 
of pretesting, retesting, coach ing, and so on, v/ith the same instru- 
ment, but perhaps with di f ferent paral lei forms. In other words, these 
investigators of standardized ac'.ievei.ient tests attempted to see v;hat 
1 ea rn i ng takes place witf. standardized tests themsel ves (as compared 
to the learning that takes place in the usual nontesting, lecture 
aspect of instruction); some would label such standardized test 
learning benefits under the somevvhat undesirable-sounding names of 
"practice effects", "coaching effects", and so on. Such psychological 
effects involved in standardized testing are important. Howevei , in 
the real school situation, standardized achievement tests are given 
rather infrequently during the school year to any one student. Hence, 
the practical use to which such experimental conclusions could be 
put is quite limited i ndeed , 

On the other hand, the i n formal ach ievement tests given by teachers 
throughout the course of instruction constitute a major part of the 
curriculum. If lea rn i ng effects (not just the usual eval uat i ve func- 
tions) above and beyond the in-class, non testing, instructional pro- 
cess can be produced from the taking of informal achievement tests 
themselves, or from using informal achievement tests In a specific 
way, then such knowledge would be highly practical for the real istic 



classroom setting. Hence, this review v/i 1 1 direct itself to identi- 
fying the various v/ays in which informal achievement tests can a i d 
instruct ion . The review is therefore making a unique contribution 
to testing literature; most testing revi ews have cons i dered on 1 y the 
usually cited function of tests: to evaluate and rank the student's 
achievement. Other testing reviews have dealt v/ith the technical 
issues of test construction: reliability, validity, item difficulty, 
item discrimination power, and so on. On the other hand, this re- 
view neglects the already wel 1 -documented , usually-discussed topics 
of testing and concentrates only on how a student can actually 1 ea rn 
from the very taki ng itself of tests. 

There are several ways in which informal achievement tests can 
be used to yield 1 ea rn i ng benef i ts above and beyond the usual instruc- 
tional, nontesting part of the classroom procedure. However, running 
throughout all of the various methods of informal test use is the com- 
mon thread of the emotion-producing situation of being under the threat 
of a test. As the reader will see, apparently the threat of a test 
is psychologically effective enough (albeit in many different ways) 
to force the test taker to be more careful of the way in which he pro- 
cesses the test information as compared to being in a mere nontesting, 
practice situation. (Analogies can be made with the results of the 
voluminous research in programed instruction, although such research 
has been treated adequately elsewhere. Hence, this review will not 
concern itself wi th' programed instruction.) In effect, the student 
under the testing condition perhaps is being forced to concentrate 
on the material presented in the test moreso than he would under just 
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a mere praclice condilic.i. The clearest: example is the essay lest 
as used in high schools and colleges. Here the student is asked to 
synthesize information he has learned class in v/ays somewhat dif- 
ferent from those in which the material was originally presented in 
class. It is reasoned that if the content of the test items is pre- 
sented in a slightly different order or context from the original 
presentation during the nontestiny, instructional part of the class 
period, then the student is forced to think about the subject matter 
in a more meaningful fashion than mere rote recall. In effect, 
the content is structured more meaningfully in the student's mind. 

The same features of enforced activi ty wi th the subject matter 
of the test questions and the potential structuring effects in the 
student's niind of such subject matter can be found i r. all types of 
tests in all subject areas, not just the essay test that is used in 
predominantly verbal subjects as compared to science and mathematics. 
To identify just what it is about a test (the eniotional effects, the 
structuring effects, the practice effects, and so on) that causes 
incr eased learning is still a moot point; adequate measur i ng pro- 
cedures have not yet been devised, especially for the physiological 
aspects of te:t behavior. The reviewer is av/are of no studies that 
have investigated what happens to blood pressure, pulse, brain wave 
patterns, and so on, in rea 1 i s t i c testing situations. It is true that 
such physiological .measures have been taken in unrealistic laboratory 
situations with respect to rather contrived and often trivial tasks, 
but these studies do not concern the topic at hand. 



NoncLhelcbs, cerlain aspccLo of infonnol, rcalisUc achievement. 
test5 05 a l eoining ci evice have been identified and r.^an i pu 1 a ted in 
an effort to yield learning benefits above and beyond the u?,ual non- 
testing, instructional part of tlie procedure: (i) frequency of testing, 
(2) test grades, (3) test correction, (A) test result feedback, (5) pre- 
testing, (6) retesting, (7) test expectation, (8) test exemption, (9) stu- 
dent preparation for tests, (lO) student attitudes to'ward tests, (11) test 
type, and (12) "test-like events". Before reviewing the corresponding 
references for each of Ihc tv.elve topics, a brief description of each 
area v-yi 1 1 be given in tlie context with which it vn 1 1 be used through- 
out the review. 

The first topic, frequency of testing, can be found to aid 
learning. In this review, frequency of testing is defined as how 
often the teacher gives an informal achievement test in the course 

of instruction. When one controls all other pertinent variables, 
he can easily see how frequency of testing might at least create the 
potential for Increased learning beyond the usual in-cidss instruc- 
tional procedure. First, the students, by the very enforced activity 
of going through the material on the test, are getting additional 
practice with the subject matter. Second, the threat of frequent 
tests might motivate the students to prepare their lessons better 
outside of clr^ss. Third, the students probably gain insight from 
the tests as to which topics in the subject matter are most impor- 
tant and therefore should be mastered. 

Test grades, the second topic in this review, will involve only 



the use of so-called extrinsic motivation in the form of grades re- 
cei ved on i n forma 1 ach i evemen t tes ts ♦ (Al though much resea rch and 
exposition already exists on systems of report card grading, such 
motivation is out'^'de the periphery of this paper.) If students 
know what grade they received on an informal achievement test, this 
situation might be expected to motivate them to greater accomplish- 
ment in later performance. Any rivalry that might build up betv;een 
students in a class could also have a beneficial motivating effect. 

The third topic of this review, test correction, concerns v;hether 
or not the students correct their informal achievement tests in class, 
as v;ell as v;hether or not the teacher provides comments about the 
mistakes. it might be expected that, above and beyond the learning 
that takes place during the nontesting instructional process itself, 
a student can learn additional information from the way in which 
mistakes are corrected. First, if the tests are randomly handed 
back to the students the next day v;hile the test format and in- 
structional process content are still clear in the students' minds. 
It would logically be expected that additional learning will take 
place when the student sees the error: of others, asks himself why 
a problem is v;rong, and tries to relate this to what he did on his 
test. Second, regardless of this first consideration, after correc- 
tion of the tests (either by the teacher or by the students), if the 
teacher then writes comments on each paper pointing out a student's 
difficulties, or praising him, additional learning would again logi- 
cally be expected to occur. Under different conditions, such addi- 
tional learning might not be expected to occur. 
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Test result feedback, the fourth topic of this review, concerns 
whether or not the teacher gl^ f.he corrected tests back to the stu- 
dents as v/ell as whether or n^ .le teacher discusses the errors in 
class. This Is perhaps one of the most fertile areas from which test 
learning benefits can be derived. If the student receives his corrected 
informal achievement test back as soon as possible, it might logically 
be expected that he will still be interested enough to examine what- 
ever errors he had on the test. The student might analyze just why 
.le made such errors and how he might rectify them. Further, if the 
teacher also discusses the general classes of errors made in the test, 
the students would logically be expected to benefit from such a dis- 
cuss ion. 

The fifth topic of this review, pretesting, deals with the effect 
of giving a pretest over the unit of instruction to be studied before 
such study is actually begun. This is a particularly interesting 
topic, since it involves not only psychological problems but also 
methodological ones as well. Psychologically, if student's are given 
a pretest over the unit of instruction to be studied, then one might 
expect the subject matter of the future instructional unit to be 
structured to some extent in their minds; the students know what to 
look for in their ensuing study by the very nature of the questions 
asked on the pretest itself. Methodologically, several investigators 
have considered such learning benefits In a negative light and called 
them "test sensitization"; these Investigators are interested mainly 
In the "practice effects" that were mentioned earlier. 

Retesting, the sixth topic of this review, will concern Itself 
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with the phenomena of recall and reminiscence in connection with re- 
alistic, meaningful classroom material. If students are retested 
(as compared to mere once~and-done posttesting) on subject matter, 
they might be expected to retain it longer than those who are not 
retested at periodic intervals. It should be noted that retesting 
deals with using the same instrument (or an equivalent form) over 
and over with respect to the same specific subject material, while 
the earlier-mentioned frequency of testing concerns itself with 
using different instruments cove ring different specific subject 
matter in a high-frequency schedule of testing. Practice effects 
and structuring effects are probably pertinent issues in retesting. 

The seventh topic of this review, test expectation, deals with 
whether or not the students have been warned of an approaching test. 
If one is warned about a future test, he might be expected to study 
more than he ordinarily would outside of class in preparation for 
the test. On the other hand, certain students who are affected 
adversely by the very concept of "test" might do better if they are 
given the test in an unannounced fashion, relying on their usual , 
noncramming study habits; this might logically be expected to be 
true in the case of poor achievers. 

Test exemption, the eighth topic of this review, concerns itself 
with both exemption from testing and exemption by testing. Both have 
motivational properties and can be expected to increase learning above 
and beyond the usual level of direct, in"c]ass learning. If a student 
knows he can avoid certain tests by demonstrating a certain level of 



8 

competence in his daily practice v/ork, or if he knows he can avoid 
repetitious work by taking an examination on it, then he might be ex- 
pected to exert greater effort during the usual, in-class instructional 
process . 

The ninth topic of this review, student preparation for tests, 
deals mainly with the type of test the student expects. If he anti- 
cipates a test that emphasizes very specific details, he might be 
expected to study in a different manner than if he expects a test 
dealing with broad generalities. 

Student attitudes toward informal achievement tests, the tenth 
topic of the review, concerns any systematic survey into students* 
preferences for various informal testing procedures. This is self" 
expl ana to^-y . 

The eleventh topic of this review, test type, deals with at- 
tempts to determine whether or not different test types (multiple- 
choice, completion, true-false, essay, and so on) used in connec- 
tion with the same specific subject matter will yield differential 
learning benefits as measured by followup uniform testing procedures. 
This can be a very significant topic .for the actual classroom situa- 
tion. For example, if a student can be forced to synthesize and pro- 
cess subject matter more effectively on one type of test as compared 
to other types, then the teacher would do well to use such a test 
type f requent 1 y . However , it mus t be noted here that the revi ewer 
is interested only in learning benefits that accrue to the s tudent ; 
the advantages and disadvantages of one test type versus another 
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with respect to technical test features (reliability, validity, diffi- 
culty, and so on), administrative efficiency, and so on have been ade- 
quately covered elsewhere by specialists in tests and measurements. 

The twelfth and final topic of this review, ''test-like events", 
actually forms a convenient bridge to the discussion on implications 
for further research. In fact, the topic of "test-like events" has 
formed the Intensive and recent research efforts of only a few ex- 
perts in the field of learning theory. The phrase "test-like events" 
was coined by Ernst Z, Rothkopf of Bell Telephone Laboratories, Inc., 
to cover learning situations using written, highly verbal, and non- 
programed material (that is, the most commonly used expository 
passages used in such courses as English, history, geography, and 
so on) where the student is evaluated frequently by means of study 
questions in a "test-like" (that is, evaluative) manner but yet not 
a true testing situation. Further, Rothkopf has coined the term 
"mathemagen i c behavior" to cover all the emotional, physiological, 
and cognitive activities the student engages in as he learns the 
written material via the study questions. The techniques of in- 
vestigation used by Rothkopf form a unique methodology for investi- 
gating the other eleven topics mentioned above that was never avail- 
able before. Thus, the fine points of the testing situation (struc- 
turing effects, practice effects, and so on) that produce additional 
learning benefits can at last be investigated to a depth never 
achieved before. One might conclude that any type of questioning 
activity could be labeled "test-like"; however, this review will 
consider the effects of study questions only as they relate to 
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written material such as textbook passages (and the extension of such 
yyritten passages it;to the real-life, v^r i t ten , test situations). Other 
questioning activities, such as oral questioning and recitation in 
class, homework for homework's sake outside of class, and so on, will 
not be considered as truly "test-like" in nature. Only those non- 
testing, wr i tten , questioning activities in reference to correspond^ 
ing wr i tten passages are considered to be sufficiently "test-like" 
in nature to warrant inclusion in th»s review of test learning 
benef i ts . 

The preceding discussion completes a brief overview and informal 
definition of each of the twelve topics to be taken up in this review. 
However, before attempting a review o^ the first topic, frequency of 
testing, it will aid the reader if he first considers the general, 
nonresearch references that have suggested the additional learning 
benefits above and beyond the usual nontesting aspects of the instruc- 
ional process that can arise from using informal achievement tests 
i n spec! f i c v/ays . 

INTRODUCTORY REVIEW OF^ THEORETICAL 
REFERENCES ON^ INFORMAL ACHIEVEMENT TEST LEARNING 

BENEFITS 

McKeachie (1963, P- '15^) says: 

While we usually think of testing procedures in terms 
of their validity as measures of student achievement, 
their function as instruments for promoting learning 
may be even more important. After dismal recitals 
of nonsignificant differences between different 
teaching methods, it is refreshing to find positive 
resul ts from variations in testing procedures . 



Anderson (I960, p. 50) provides a well-stated summary of the 

whole problem; 

But to say that teachers don't sometimes begrudge 
the time taken for [informal achievement] testing 
and that most students face 'test day' with real 
enthusiasm is going too far in the other direc- 
tion. Yet this is just vyhat we may be able to say 
soon if tests can be utilized to support the 
process students and teachers are most concerned 
about--if tests can be used to teach students 
something. Furthermore, it may be the case that 
eventually tests will be as useful for teaching 

for m easuring . [underlining inserted by 
rev! ewe r] 

In support of the experiment presently being conducted by the 
revievyer, Gardner (1953, p. 87) says, . . more research should be 
done regarding practical problems encountered by teachers in the 
classroom (and by students as well) :n their use of both standardized 
and i nforma I tesis." [underlining inserted by reviewer] 

Koester (1957) claims that the evaluative function of informal 
achievement tests is overemphasized in relation to their instructional 
potential. Making planned use of informal achievement tests in high 
school English classes, asserts Kimmel (1923), has yielded greater 
than usual learning benefits, although no actual experiment was 
carried out. Obourn (1932) cites the opinions of several physics 
teachers in high school who have found that informal achievement 
tests can be used as teaching devices as well as evaluative instru- 
ments . 



Ruch (1929, p. 1^5) urges that "we must abandon the thoroughly 
untenable position that time spent in testing is time wasted in 
teaching. Teaching and testing are aspects of the same process." 
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Many other early writers and theorists have v;ritten about the 
learning benefits that can arise from informal achievement tests: 
Butler (1922), Elston (1923), Lockhart (1928), V/oody (1929), Fenton 
(1929), Henricksen (1930), and Symonds (1933). However, none of the 
references in this section of the reviev; have supplied experimental 
evidence to support their beliefs. 

This completes the introductory review of nonresearch references 
which have referred to the instructional values of informal achieve- 
ment tests only in a general way . Nonresearch references that try 
to hypothesize preci sely how such additional learning benefits arise 
will be treated in the appropriate section in the twelve specific 
reviews that follow. The first topic is frequency of testing, which 
is of major importance to the reviewer in his dissertation. 

REVIEW TOPIC ONE: FREQUENCY OF INFORMAL 



ACHIEVEMENT TESTING AS RELATED TO TEST LEARNING BENEFITS 



Nonexper imental References : Wrightstone (I963, pp. 50-51) gives 
those interested in frequency of informal achievement testing some 
precise guidelines but fails to account for the many contradictory 
f i nd i ngs : 

Some persons have assumed that more frequent tests 
will increase the motivation and effort of the stu- 
dent to achieve immediate educational goals. Carried 
to a ridiculous conclusion, this might mean one test 
per teaching period. When tests are administered 
too frequently, their motivational value is reduced. 
In a variety of fields at the col 1 ege level, studies 
show that v/hen weekly tests are give^ discussed, 
and corrected, the lower-abi lity students achieve 
more on a final examination of similar questions 
than with less frequent examinations. The more-abl e 



students may be retarded because of too frequent 
testing. The less able prof i t: mainly from direc- 
tion of their learning to specifics and to practice 
in selecting the correct responses. The mo re -a b 1 e 
are not a i ded by f requent--weekly or da i 1 y-- tes ts . 
[underlining inserted by reviewer] 

Unfortunately, Wrightstcne omitted aJJ^ negative results in the 
list of experiments he cites to support his conclusions; further, the 
•'positive" results of the experiments he does cite are often con- 
founded by other uncontrolled variables. Moreover, one still has no 
concrete evidence as to what occurs in the elementary school where 
the ideas toward testing in general are being molded in the students' 
minds; the investigators and theorists continuously emphasize college 
studies but only rarely touch on the more crucial, formative years: 
kinder garten through g rade twel ve . 

Gardner (1953, p. 87) displays a common misconception among 
educational theorists with respect to frequent testing. In relation 
to one of the experimental studies in this field, he says that the 
investigators . . have again demonstrated the motivating effect 
of frequent testing." In the first place, not many investigators 
have "demons t rated" such an effect. To the contrary, many con- 
flicting reports are available; positive results have been the 
exception, not the rule. It appears that any results that are 
obtained from frequent informal achievement testing must be qualified 
with respect to control variables such as grade level, previous 
achievement, sex, and so on. The practical implications of this 
misconception are important; no doubt many teachers are presently 



laboring under the belief that frequent testing will drive their 
students on to higher achievement. However, no safe conclusions can 
be drawn on this issue. 

Many early theorists voiced their support of frequent informal 
achievement tests. Ruch (1929) says infrequent tests of an extensive 
nature should be laid aside in favor of shorter, more frequent test" 
ing. Pearson (1929) supports frequent informal achievement testing 
in his city schoo 1 sy s tem. Ode 1 1 ( 1 928) thinks pup i 1 ach i evemen t 
will increase under frequent informal testing but gives no evidence. 
Further, he suggests that both slow and rapid learners v;ill benefit 
froma program of frequent testing. Finally, he says that such in- 
formal achievement tests should be given often only if they are 
relatively short. Opdyke (192?) also voiced the latter idea. 
Parker (1920) says frequent testing is needed to make students 
prepare their lessons adequately. Ragusa (1930) offers the same 
opinion and in addition claims that stort objective tests should be 
given two or three times a week. 

Exper i menta 1 Fi nd i ngs : In this section all of the experiments 
that the reviewer was able to find in his search of the literature 
are presented in connection with frequency of testing. The review 
of exper iments will be conducted in three parts: (l) el emen ta ry 
school, (2) junior and senior high school, and (3) post-high school. 
The experiments wi H be taken in chronological order. 

Each experiment will be described in as much detail as is 
possible and practical. Particular attention will be given to flaws 
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in design and analysis. However, before getting into specific criti- 
cisms of any one experiment, a general distinction will be made by 
the reviewer between two types of allegedly "loose" research. The 
first type will be termed "broad curricular research" by this reviewer 
for expediency's sake later in this paper. "Broad curricular research" 
is defined here as an experimental comparison between two curricular 
programs of instruction. For example, a school district might be 
interested in coinparing the effectiveness of a nev; laboratory-discovery 
approach of teaching junior high school science against the traditional 
lecture-textbook method. No matter how much effort is taken to control 
extraneous factors, the most that one will be able to conclude from 
his results is that the new program taken as a v;hole is or is not more 
effective from the cumulative achievement standpoint than the tradi- 
tional program; one will never be able to isolate just v;hat aspect 
of the new program was or was not a causative factor in the results. 
In effect, control in the "basic research" sense is lacking. Many 
potential causative factors are confounded w? th each other. The 
above definition, then, is what the reviewer will henceforth mean 
by "broad curricular research." However, it must not be inferred 
by the reader that the reviewer looks down upon the above type of 
research; indeed, such research is a necessary ingredient of cur- 
ricular progress. 

The second type of "loose" research is the "confounded, non- 
curricular experiment." In the above definition of "broad curri- 
cular research" the confounding of various component factors Is 
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unavoidable; these factors are inherent to the curricular program 
and hence cannot be "controlled" (or isolated) without destroying 
the integrity of the method. However, the reviewer considers a 
topic such as frequency of testing to be independently man i pul atabl e 
of the particular curricular setup being used. In :)ther words, not 
considering administrative difficulties, a topic such as frequency 
of testing should be highly amenable to control in the research 
sense. A large number of experiments, however, have inadvertently 
confounded the factor of frequency of testing by the manipulation 
of other variables at the same time as frequency of testing. Thus, 
in thi.s review, the "confounded, noncurr icular experiment" will 
be considered as a definite design error that could have been avoided 
by thoughtful planning. (The reader should also be aware, however, 
that a confounded design can be a deliberately planned advantage-- 
rather than an inadvertent error^-'if the investigator is interested 
in highlighting certain interactions or main effects. However, all 
confounded noncurr icular experiments in the following sections of 
this reviev;were design blunders and not sophisticated analytical 
ref i nements) . 

The reviewer is now ready to proceed with the discussion of 
experimental studies under the topic of frequency of testing. The 
first set of experiments to be considered is that of the elementary 
school . 

ft 

(a) El ementary School : As already stated previously, the 
reviewer considers testing procedures in the elementary school to 
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be crucial to the children's attitudes that are in their formative 
stage. Out of a total of 27 experiments done in the area of fre- 
quency of informal achievement testing, only one study v/as done at 
the elementary school level . 

Mann, Taylor, Proger, and Morrell (to be published) dealt with 
daily testing in third-grade arithmetic. The study material v/as 
multiplication. This pilot study v/as conducted in Spring, 1967, to 
determine v/hether or not being under the psychological threat of 
frequent testing in a natural learning situation during a unit of 
instruction v/ill result in beneficial content structuring effects, 
increased attention to material, and so on, in terms of immediate 
and delayed retention. 

Four randomized groups of about tv/enty students each v/ere 
formed: BE, GE, BC, and GC (E, C, B, and G represent "experimental", 
"control", "boys", and "girls", respectively). To control for 
differing teacher effectiveness and possible interaction effects 
of teacher personality with students' personalities, the four 
teachers were randomly rotated throughout the four groups f rom 
day to day. Within each group, low and high previous achievement 
categories were identified (ex post facto) on the basis of the 
final arithmetic mark in second grade. The two E sections received 
practice worksheets which were to be counted in with their total 
mark and accordingly v/ere letter graded, while C received the 
identical worksheets and were told they v/ould count only as prac- 
tice. All four groups received the worksheets back in class the 
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next day and v;cre told to locate and correct the errors tliey had 
made. During rr.ost of the rest of the class period, the ncv; v;orksheets 
v;ere used. The experiment lasted Lv/enty class days. 

The revievMcr is presently analyzing the results of this experi- 
ment. The design is tota 1 ly confounded across blocks (groups) with 
respect to methods and sex. However, within each block, an uncon- 
founded comparison can be made between high- and low-previous 
achievement subgroups. Although the confounding in this design 
..as inadvertent, it actually aids tfie experimenters in studying a 
particular aspect of the problem: the interaction of methods by 
previous achieverrent , a confounded design in this specific case gives 
a more powerful test of such a measure. The investigators v;ere 
especially interested in testing v;hether or not high previous 
achievers would do better in E than C (that is, good students 
might like the challenge of a test condition rather than a prac- 
tice condition) and low previous achievers would do better in C 
than E (that is, poor students might feel more secure under the 
nonthreatening practice condition than the experimental one). 

The study by Mann et al , (to be published) was the only one at 
the elementary school level on the topic of frequency of informal 
achievement testing. This is one reason why the reviewer decided 
to do his dissertation experiment on comparing daily testing, 
alternate-day testing, once-a-week testing, and no testing in 
sixth-grade arithmetic. With the Mann et a 1 , (to be published) 
study above, the reviewer's dissertation, and the experiments 
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to be described belov/ at the junior and senior high school and post- 
high school 1 evcl s, developmental implications might arise. 

Juri'Qr and Senior H'gh School : Of a total of seven experi- 
ments at both the junior and senior high school levels, only one study 
touched upon the junior high school level: Maloney and Ruch (1929). 
They compared three methods of teaching grcmmar in ninth, tenth, and 
eleventh grades in junior and senior high school. Three methods 
groups were formed from a total of ^97 students. The first group 
used only the textboo' and was given no tests* In the second 
group, ten 25~item tests were used as instructional material in 
place of the textbook; another five 25-item tests were used for 
evaluative purposes. The third yroup was taught by a combination 
of the textbook and five short tests. 

Although no significance of results was stated, a trend was 
noted with the test group achieving highest and the combination 
method next. The reader should note that, while lacking perfect 
control, the three methods have pedagogical soundness. The re- 
viewer considers Lhis experiment to be "broad curricular 
research"; a natural learning situation prevailed. However, one 
will never know just what it was about each method that caused 
the weak but notable trend in results, 

No other studies on frequency of testing occurred at the 
junior high school level (grades seven through nine). Thus, the 
studies that dealt exclusively with the senior liigh school level 
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are considered next. 

Kitch (1932) studied the effect of frequent informal 
achievement tests when used only as practice (not counted in 
with the students' grao(;;s) . The students were enrolled in tenth- 
grade high school biology, C consisted of the group of 89 stu- 
dents who took the course the preceding year, while E was the group 
of 88 students currently available. No attempt at matching the two 
groups was made, since the investigator found that the difference 
in average Terman Intelligence Test scores for the two groups 
resulted in an insignificant Z value (P = .98); hence, he con- 
sidered the tv;o groups to be initially equal for his purposes. 

Short, frequently-given practice tests were given to E but 
not to C. E v;as told that these short tests would not count in 
their grade. Unfortunately, nowhere in his exposition does Kitch 
say just how often such practice tests were given to E; one 
v^ould assume that E was given at least one practice test for 
each unit. The first four units of instruction were covered in 
this manner. For the fifth unit of instruction, no practice tests 
were given to E; in effect, other than carry-over effects, E and 
C were treated alike. During the first four units, the next day 
the* papers were returned to each student so that he did not re- 
ceive his own paper; the papers were then corrected by the pupils. 

Both E and C received the same major unit tests. These tests 
formed the main criteria of comparison. ''Though there had never 
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been any attempt to standardize these tests it was believed that they 
were sufficiently valid and reliable for the purpose of this experi- 
ment. Observations made during the period in which these tests have 
been used indicate that with comparable groups, comparable scores 
have been made." (Kitch, 1932, p. 39). Using number of errors 
made, the investigator presents simple Z ratios for each unit: on 
the first unit, C > E (P2"tail ~ -02); on the second unit, 

C > E (P2.tail " -'^^^ ^^"^ ^^''''^ ^ ^ ^ ^^2-tail ^ '^'^^ 

on the fourth unit, C > E (P2-tail " -20); and on the fifth 

unit, C > E (P2.tail " 

Kitch concludes that such self-scored practice tests are well- 
worthwhile in subject areas v;here many facts are presented; the 
teacher is saved a lot of time from paperwork, while at the same 
time the student is motivated to prepare himself better. The reader 
should note, however, that, as far as motivating students to prepare 
better outs i de of class is concerned, such would probably not be 
the case in elementary school where most students do not as yet 
take their school v;ork very seriously. 

Connor (1932) investigated the effect of frequent testing in 
high school physics. From a working pool of seven experimental 
classes and ten control classes, he formed four matched groups on 
the bases of mathematical ability (Ki 1 zer-Ki rby Inventory) and 
intelligence (Ot i s 'Sel f -Admi n i s ter i ng Test of Mental Ability). The 
experimental groups made use of the "Instructional Tests in Physics" 
by Glenn and Obourn. Twenty of these tests were given to the 
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experimental students during the school year. However, when measured 
on the two posttest criteria (Harvard Elementary Physics Test and the 
1930 Iowa Academic Meet Test in Physics), the control groups, on the 
whole, exceeded the experimental groups, Connor attributes his re- 
sults to the fact that the twenty frequent tests took away instruc- 
tional time from the experimental groups; he does not advocate the 
use of such frequent testing in high school physics, 

Connor's procedure can be termed "broad curricular research"; 
if a program of frequent testing is to be used, instructional time 
apparently must decrease in relation to the control group (such 
need not be the case, if administrative difficulties can be over- 
come, as was the case in the reviewer's dissertation experiment). 
An investigation such as Connor's study is useful for comparing 
broad curricular programs of frequent testing. However, "broad 
curricular research" will never answer the more psychological, 
noncurricular problem of whether or not the tes t i tsel f , by virtue 
of its threatening, negative aspects or positive, motivating 
effects, can actually teach the child something just through 
increased attention to the matter at hand. This type of reason- 
ing refers only to what occurs during the taking of the test in 
class , not to the motivating of the student to prepare for the 
test outside class . Both issues are important; however, the former, 
being more "basic" in nature (Psychologically oriented) and harder 
to control in research, has usually been ignored. Again, one should 
be able to see the distinction between "basic" research and 'burr icular. 



I 



23 



McClymond (1932) conducted a study very similar to that of 
Connor (1932): using or not using the Glenn-Obourn practice tests in 
high school physics. Hov;ever, McClymond did get significant achieve- 
ment results in favor of the students who had used the Glenn-Obourn 
tests. But the investigator points out that the testing time was 
subtracted from the laboratory periods and not from the lecture 
periods, as in Connor's study; McCIymond's procedure thus resulted 
in greater con t rol than in Connor ' s exper i men t . 

Weissman (193^) investigated daily testing in high school 
physics. E (l8l students) received daily tests for about nine weeks; 
C (180 students) did not receive ar^ daily tests. The groups were 
matched as far as possible on chronological age, cumulative mathe- 
matics average, cumulative physics average, and pretest score. It 
was found (E - C) = 6.69 ± 0.89 (P « .001). 

Curo (1963) studied the effect of daily quizzes on eleventh- 
grade American History classes. Three types of schools were used: 
large metropolitan, medium suburban, and medium rural -consol idated. 
Ten intact classes were divided into five experimental groups and 
five control groups. "To minimize the effect of varying teacher 
competence, each teacher instructed one or more pairs of control- 
experimental classes." (Curo, 1963, p. 70) . The experiment lasted 
s ix weeks . 

A cumulative pretest was given to all E and C classes; the 
pretest was the First Semester American History Test, State High 



School Testing Service for Indiana. Otis Mental Measurement scores 
were also available. The control classes received tv;o major unit tests 
but no daily quizzes during the six-v/eek study; the experimental 
classes received the daily quizzes but not the major unit tests. 
(Because of this inequity in the amount and kind of testing, the 
reviev;er classifies Curo's study as "broad curricular research**.) 
For the E classes, the daily quiz was given during the first five 
to ten minutes of the class period on a previous study assignment. 
The investigator devised his own 135"*item posttest. Also, all 
classes were given the original pretest again as a second immediate 
posttest. 

In the analysis of variance for the original pretest, the two 
classes of the suburban school v/ere so discrepant in variance 
(97.63 versus 85.78, although no test of significance is provided) 
that the whole school was deleted from all subsequent analyses. 
Then, eight intact classes v/ere left for a total of 18^ students. 
To achieve equal cell frequencies or ^^6 (sum of all students within 
one school for one method), random selection v;as used. On the 
original pretest, no significant differences were found for methods, 
schools, or methods by schools (all F's <^ 0.50). When the same 
pretest was used as a posttest, school s came closest to significance 
(P = .30), but again methods and methods by schools were definitely 
insignificant (both F's <^ 1 .00) . On Curo's own posttest, schools 
were significant {.025 <( P < -05) , methods were next (.10 < P < .25) , 
and methods by schools last (P - .30). 
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Curo concludes that using daily informal achievement tests to 
increase the learning of factual material is not effective at all. 
Perhaps Curo's analyses v;ould have yielded more accurate conclu- 
sions if he hac' used analysis of covariance instead of his three 
separate analyses of ordinary variance. Further, his inspection 
method of deciding whether or not to delete a school or pair of 
classes on the basis of apparent nonhomogenei ty of variance is 
open to criticism; rather, a transformation of scores is suggested. 

The last study at the senior high school level on frequency 
of testing is Pikunas and Mazzota (1965). They studied the effects 
of weekly testing in twelfth-grade chemistry in a large city 
technical high school, A total of 128 students v;ere taken from 
four intact classes, Tv;o intact classes were assigned at random 
to the first treatment group, v;hile the second treatment group 
received the other two intact classes. Each class met for three 
68-minute periods a week. The study lasted twelve weeks. One 
chapter per week was covered. The twelve chapters were divided 
into two independent sets of six each: -A and B, 

A crossover design was used. During the first six weeks, both 
groups of classes did not receive weekly tests; the first treatment 
group (two intact classes) had the A chapters, and the second treat- 
ment group (two intact classes) had the B chapters. The six-week 
examination was from the publisher of the textbook chapters. This 
test was not returned to the students after they had taken it. All 
tests in the study (the six-v;eek criterion and the weekly noncriteri 
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were corrected by a teacher not involved in the instruction of the 
experiment. During the second six weeks, both groups of classes re- 
ceived a quiz once a week, again taken from the publisher's test 
booklet for his textbook. However, this time the first treatment 
group received B, and the second treatment group received A. The 
lost teaching time because of the weekly tests during the second 
six weeks was compensated for by giving enforced study periods 
during the first six weeks. 

The analysis supplied by the investigators consists only of 
crude percentage statistics pooled across chapter sets and treat" 
ment groups to obtain E versus C (E = 70. and C = 60.77%). 
No statistical test was made. The investigators caution, "There 
is, of course, a aanger of becoming too preoccupied with testing, 
and of allowing this preoccupation to lead to a distorted situa- 
tion where testing is credited with attributes and accomplishments 
which it does not possess," (Pikunas and Mozzota, 1965, p. 375)- 
They also claim, "Additional investigations are necessary to find 
out whether this also applies to other subjects than science and 
whether this applies on each level of education." (Pikunas and 
Mazzota, I965, p. 376). 

In the above experiment, to form the comparison of major 
importance — E versus C--one must admit the flaw of "history" and, 
to a lesser extent* "maturation" (as defined by Campbell and Stanley, 
1963). Tne reviewer cannot see the logic of this particular cross- 
over design. Apparently the use of two different sets of chapters 
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at the same time was' to counter any exchange of information from one 
group to the other, but this is of no benefit to the E versus C 
comparison in the present design anyhow. Further, with the original 
design, one cannot determine whether the apparently superior per- 
formance of the E groups of classes (that Is, all the classes during 
the second six weeks) was caused by the weekly tests or by the per- 
haps better-activated study habits of the students, having already 
gone through six chapters of the same publisher's format. 

(c) Pos t-Hi gh School Exper i ments : Nineteen studies fall into 
the post-high school category. Before reviewing the experiments, 
however, the reader should note that any attempt to generalize the 
findings of college studies down to the high school and, especially, 
the elementary school might be on very shaky ground. The college 
milieu is very different from that of the public schools. College 
classes meet only two or three times a week and sometimes only 
once a week; the students are allowed much more responsibility than 
in their pre-college days; they have much more free time to do 
with as they please. In fact, within the public school system, it 
is true that even high school and junior high school are psycho- 
logically different. Such psychological differences among educa- 
tional levels become of crucial concern in the elementary school 
where the child, still in his formative years, first meets the 

threat of informal achievement tests. It is here that least is 
« 

known about the way informal tests can be used as learning devices, 
and it is here that generalization from the college situation is 
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most dangerous. For one thing, students cannot be expected to pre-- 
pare very extensively, if at all, for tests on their own time ou t s i de 
class; rather, one must concentrate attention on the problem of what 
effect informal tests have on his i n~cl ass performance , assuming 
most, if not all, preparation for tests must occur jJl class itself. 

Deputy (1929) studied frequent testing in second-semester 
freshman philosophy at a state university. The testing material 
consisted of the preceding day's lesson. Each class met two days a 
week. At the start of the semester all students were given the 
Otis Se 1 f- Admi n i s teri ng Test of Mental Ability: Higher Examination, 
At the time of the experiment the reliability of the Otis test was 
given as .92. Two experiments were performed on the same students 
of three intact sections: one study during the first half of the 
semester and one study during the second half of the semester. 

During the first half of the semester, three different testing 
methods were compared. Section E| (30 students) received a ten- 
minute .;' Iz every class day. Section E2 (33 students) received a 
twenty-minute quiz once a week. Section (33 students) received 
ho quizzees or unit tests. The investigator paid especial attention 
to rationalizing to the classes the differences that existed among 
their procedures. Code number" v;ere given to each student so that 
only he would be able to recognize his results on the blackboard the 
next day. This fiHSt experiment lasted for six weeks until the 
mid-semester test. 
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During the first half of the semester, the z test results for 

the Otis test were: > (P2-taiI " "^^^ ^ ^2 ^ ^3 ^''2-tail " -^^^^ 
and Ej E^ (^2-taiI ~ .20). Since none of these initial difference 
measures v^as statistically significant, Deputy goes on to discuss 
differences on the criterion posttest (mid-semester examination). 
On this test significant z test results were found :Ej 

(P2-tail « -OOl); ^3 > ^2 ^^-tail = • 1 3) ; and E , > E2 
^'^2-tail .001). In summary, Ej significantly outperformed 

both E2 and E^, while E2 apparently was not even as effective as E^. 

During the second half of the semester, E^ received a ten- 
minute quiz each class meeting, while both Ej and E^ received no 
quizzes at all. On the final examination (about 100 items), in- 
significant z test results were obtained :Ej ^ E^ ^'^2-tail ~ -2^)' 
^2>E3 (P2-tail = .05); and E2>E, k^^.^^;^ = .27). 

The analyses of both Deputy experiments can be criticized on 
the grounds that simple z tests were used instead of a combination 
of analysis of variance and multiple comparisons. In fact, the 
pretest Otis scores should have enabled an analysis of covariance. 
Further, the final examination performance of all three groups 
during the second half of the semester is contaminated by unequal 
carryover effects from the first half of the semester* 

Serenlus (1930) dealt with frequency of testing in college 
history. The looseness of the design would qualify it as "broad 
curricular research." Two cl asses (tota 1 I ng students) each 
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had the same lecture class twice a week. During the third meeting 
of each v/eek, E received a test, while C had informal discussion. 
At the end of the semester no significant differences were found. 

Turney (1931) investigated frequent testing in an educational 
psycfiology course with college juniors and seniors. Two intact 
groups were used. The final examination criterion posttest v/as 
divided i n to tv/o 175~point equivalent forms A and B, Both forms 
v/ere a combination of true-false, multiple-choice, and completion 
items. Form A was given as a pretest to both groups at the start 
of the study. The group scoring lowest on this pretest became 
E (N^ = kO and = 28). The final examination consisted of 
both forms A and B. A different mid-semester examination (neither 
A nor B) was given to both groups. E received eleven short quizzes 
throughout the semester about once a week; C had only one quiz. 
The quiz results v/ere given to the E students at the next class 
meeting but the quizzes themselves were not returned. Both E and 
C were taught by the same instructor. 

Using a third intact section of students comparable to those 
in the study, Turney claims no practice effects cou 1 d be found between 
forms A and B and that the difficulties of the two forms were equal, 
Becajjse of the results obtained from the third group not directly 
involved in this study, for computing gain scores, Turney subtracted 
two times the prete6t form A from the final test (both A and B) , 
Because of the v/ay he chose group E, on the pretest the z test found 




(p 
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- .0001). On the final examination, however, E 
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(Po .. -1 =• -99). For gain scores themselves, E C (P„ , . , = .005). 

Smeltzer (1931) studied both frequent testing and exemption from 
certain class v/ork (by test performance) in a course in undergraduate 
educational psychology. The course was given repetitively in a 
university on the tiiree-te rm schedule. Throughout each of the three 
terms an attempt v/as made to standardize presentation of material as 
much as possible by making both E and C classes follow the same course 
outline for the seven major topics of the course; a calendar of topic 
progression was also used. In each term, the procedure for the C 
intact classes was the same. C classes received two fifty-minute 
essay tests during the term. In each term, all E and C classes were 
given a pretest and an objective final examination. Any test given 
cO either the E or C classes v/as scored by the teachers and given 
back to the students at a later class meeting; the tests were cor- 
rected so that not only v/ere incorrect answers marked v/rong but 
also the correct answers v/ere provided. In both E and C classes the 
d i scuss I on- rec i tat ion method v/as used. The pretest and final 
examination v/ere parallel forms of the same objective examination. 
Throughout the three terms most of the total of 523 students were 
freshmen and sophomores. 

During the autumn term six intact classes (three E and three C) 
were used. In the E classes, a major objective examination was given 
every other Thursday outside of regular class in the late afternoon. 
Then, on the i rimed i a te 1 y following Friday morning, the corrected tests 
along with an item analysis were ready for the E instructors to use 



I 



32 



during the regular class meeting that day. During the first tv/enty 
minutes of this Friday morning class, the E students received their 
corrected tests and a discussion of the lest questions was given, 
"At the end of the [E] class period on Friday each instructor would 
read the names of approximately one-fourth to one-third of his 
students who scored highest on the examination and who would there- 
fore be excused from class the following Monday and Tuesday." 
(Smeltzer, 1931, p. 31)- The students who were required to re- 
turn on the next Monday and Tuesday underwent an intensive dis- 
cussion of the subject matter on the E te-^t of the preceding 
Thursday ; these low-scoring students then v/ere given a twenty - 
minute retest on this subject matter. On Wednesday, both low- and 
high-scoring E students resumed their usual classroom procedures. 

Tl.e analysis of percentiles on the final examination during 
each term was carried two times: (I) without matching and (2) with 
matching. (However, the reviewer sees no advantage to examining 
unmatched results when one has the matched results also available). 
"A very rigid pairing involving three criteria was used. The 
criteria were on the basis of (I) sex; (2) intelligence test percen- 
tile to the extent of - A points; (3) pre-test score to the extent 
of - 5 points." (Smeltzer, 1931, p. 90). Five percentiles were 
compared on the final examination during each of the three terms: 
90^-ile, Q^, Median, Q, , and lO^-lle. Simple z tests were then 
computed. During the autumn term only the 10^-ile (E*C) difference 

had a reasonably large z value: P == .27 (35 matched pnirs). 

2-tai 1 



However, it should be noted that, v;ith the exception of the 90%-ile 
level, all comparisons v/ere in the direction of E^C. One should 
note that the autumn experiment was of a confounded design (actually, 
a confounded, confounded design, if such can be the case; that is, 
confounding occurred not only across methods blocks but also wi thi n 
the methods blocks for different frequencies): two levels of frequent 
testing for E, by exemption. 

During the winter term three intact E classes and three intact 
C classes were again used. However, this time the E classes received 
weekly twenty-minute tests every Thursday instead of the fortnightly 
tests. After the instructors corrected the E tests outside class, 
they were returned to the E students on the follov/ing Friday morning 
for about twenty minutes. As in the autumn term, the E students had 
to return the tests to the teacher at the end of this time. The same 
exemption procedure was used for determining which students had to 
return on the next Monday. However, the retest for the nonexempted 
students was given the latter part of Monday's class meeting after 
intensive review of the Thursday quizes subject matter. Further, 
In contrast to the autumn term, each E student, regardless of being 
exempted or not, was only given numerical grades--no letter grades. 
AlsQ, using the Thursday quiz score, on the average of the quizzes 
of Thursday and Monday, o graph of each student's weekly progress 
was kept in the classroom on a large chart. 

During the winter term, only the and lO^-ile (E-C) compari- 
sons provided z tests that are worth noting: P2-tail ''^^ 
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P ^ .004, respectively. However, at each of the other three per- 
centile levels, E C. Again, one has a confounding within confound- 
ing design: frequency within frequency by exemption by motivation. 

During the spring term, E group still received the same type of 
weekly testing. However, the exemption process was modified so that 
anyone could omit the Monday class, or attend it, as he wished. 
Further, the E students were now given a weekly chart-in-class 
incentive on average letter grade (Thursday alone or Thursday 
averaged with Monday, as the case might be), as compared to the 
winter term's numerical chart value incentive. There were three 
intact E classes and four C classes. The C procedure remained 
unchanged. 

During the spring term, again only the Qj and 10%-ile 
(E-C) comparisons provided z tests worth noting: P = .06 and 
P = .01, respectively. The same design criticism can be made here 
as with the winter term: confounding within confounding (fre- 
quency within frequency by exemption by motivation). 

Kulp (1933) studied the effect of weekly quizzes on graduc?.te 
students in educational sociology. Two one-hour class meetings per 
week were scheduled. A ten-minute quiz was given each week to all 
32 students until the mid-semester examination. "On the basis of 
the rrid-term examination, the class was divided into a •high* half 
and a Mow' half. . . . Following the mid-term examination, only 
the Mow' half of the class took the ten-minute weekly tests; 
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the 'high' half was excused." (Kulp, I933, p.158). Usi ng a z ratio 

to compare the difference betv^een "halves" on the mid-semester exam-^ 

ination, H L (P^ .01); however, on the final examination, 

/- tail 

H ^ L (P^ = .36). Because of the large loss of significance 

ta I I 

of the difference between H and L during the second half of the 
semester, Kulp concludes that the weekly tests benefited L "signi- 
ficantly." 

Keys (l93^a) dealt with weekly testing in educational psy- 
chology during the spring semester. Although 36O students comprised 
the experimental and control sections, matching on the basis of sex 
and a I67"item true-false pretest reduced the total to 1^3 analyzable 
students in each section* Keys divided the semester into three equal 
parts of four v;eeks each. Although his main interest was in weekly 
tests versus monthly tests, he confounded this comparison with the 
use of study guide assignments in E. The testing procedure itself 
was carefully controlled :". . . tests administered to the two 
sections were identical, both in content and total amount, differ- 
ing only in that the experimental group took these in brief 
weekly installments, and the control in the form of long mid-term 
examinations." (Keys, l93^a, p. ^28)- 

During the first four-week period, E had specific weekly 
assignments and four weekly tests, while C had a general, monthly 
assignment and one monthly examination. During the second four- 
week period, E again had specific weekly assignments and four 
weekly tests, v;hile C also had specific weekly assignments but 
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only one monthly examination. During the last four-week period, E 
had only the general monthly assignment and one monthly examination, 
while C had specific weekly assignments and one monthly examination. 

To make comparisons between testing modes for each four-week 

period, Keys added up the four weekly tests of E to pit the sum 

against C's one monthly test. For the first four-week session, 

E C (P^_. .1 <^ .001); for the second four-week session, E C 
/ tail 

(P < .001); for the third four-week session, E C 

2- ta i I 

(P . , == .03); and for the final examination, E ^ C 
2-tai I 

^^2-taiI -13). Thus, while differences exist between groups 

during the study, the groups are not much different when cumulative 
achievement is measured at the end of the study. 

Considering equality of study-guide assignments in Keys' study, 
only during the second four-week period can a reasonable comparison 
be made between weekly tests and monthly tests. However, such a 
comparison would be contaminated by any carryover effects from the 
first four-week period. In fact, the first four-week period is the 
only one that allows any comparison free of con tami nat ion ; however, 
no matter what comparison is made at any time throughout the study, 
there will always be confound i ng present . This is an extremely com- 
plicated design situation of which the analytical considerations 
Keys was unaware. 

Eurich, Longstaff, and Wilder (1937) investigated a program of 
weekly tests in an Introductory college psychology course. The course 
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met three times a v;eck with one hour and twenty minutes to each class* 
The experimental group consisted of thirty-eight men and twenty-five 
women drawn from a total of I38 students, while the control group was 
chosen on a matching basis (College Ability Test and course pretest) 
from the previous fall semester's 288 students. Both C and E were 
given the same 36^-it ^o'jrse pretest at the start of their re- 
spective semesters. pretest covered (l) facts and principles, 
and (2) attitudes and beliefs. E was marked in terms of the per- 
cent of possible gain left after taking into account the pretest 
score. On the other hand, C was graded the usual way on the basis 
of the mid-semester and f i na 1 exami na t ions . E was given the 
original pretest as its final examination, while C had a composite 
of the first five weekly tests of E as C's mid-semester examination 
and a composite of the last five weekly tests of E as C's final 
examination. Then to get a total score for C that would be com- 
parable to E's final test, C's mid-semester and final examination 
scores were added. 

The investigators used the odd-even method of determining the 
reliability of their tests. "The reliability coefficients thus de- 
rived for the experimental group are: initial test, .95; sum of 
weekly tests, .97; final test, .97- The reliability coefficients 
for the control group are likewise high: initial test, 19^; final 
test, .95." (Eurich a_l_. , 1937, p. 336)* Using simple z ratios, 
the investigators found that for the College Ability Test, E ^ C 
^^2-tail ''^^^ pretest, E > C i^2'tan ^ '^^^ ' 
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for the final test, £ > C Cp2-tail ^ '^^^^ ^hus , as far as "facts 
and principles" are concerned, Eurjch et al , conclude that the com- 
bination of weekly examinations and grading on the basis of relative 
potential gain vjas not effective in increasing achievement. Further, 
using the extremes of the distributions (upper and lower sevenths), 
the same insignificant results v;ere obtained. When the investigators 
broke their posttest up into the ten major topics covered throughout 
the course, technical material appeared to benefit more than the 
more generalized areas. 

The procedure of matching spring semester students with the 
preceding fall semester's students can be criticized on the basis 
of "history": extraneous events that occur during one semester and 
not the other. Further, the psychological atmosphere of the spring 
semester is different from that of the fall semester; the drudgery 
of the beginning school year and the major vacation of Christmas 
affect the fall semester, vyhile the results of the spring semester 
are affected by the students' carefree attitudes. Also, frequency 
was confounded vvith grading procedure. The procedure used with E 
can be criticized on the basis of pretest sensitization; a parallel 
form of the pretest would have been more appropriate. Moreover, E 
could be said to havp undergone practice effects with respect to the 
final examination, since the ten weekly tests were made up directly 
from the items of the lengthy pretest. 

Johnson (1938) found that weekly tests resulted in signifi- 
cantly higher achievement cn an immediate posttest for college 
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students. However, this difference vanished on a delayed posttest 
six v;eeks later. 

Noll (1939) studied giving frequent quizzes in an educational 
psychology course for third-and fourth-year students at the university 
level. During one year, an intact class was given four quizzes at 
approximate i nterval s of three weeks each (E). During the following 
year, anollier intact class was not given the quizzes (C) . Both 
classes v;ere compared on the 100-item, one-hour objective mid-semester 
examination and the 200-item, two-hour objective final examination. 
The quizzes v/ere graded by the instructor and returned later for 
discussion. The students of both groups v/ere matched on cumulative 
university grade point average and American Council Psychological 
Test score. 

Thirty-three matched pairs of students were analyzed by z tests. 

On ACPT, a z test shov/ed the two groups to be comparable (P^ ^ = 

.90); similarly, the GPA means were comparable (P = .91)- 

2-ta i 1 

On the mid-semester examination, C E (P^,. .1 = -60); on the 

z — L a I I 

final examination, C E (P^^ ^ .^1). Similar insignificant 

differences were obtained when the eleven highest and eleven lowest 
students in both groups were analyzed separately, "There is no evi- 
dence here, and little in other studies, to support the common 
belief among instructors that written tests as commonly used moti- 
vate learning or increase total achievement in cojlcge classes. If 
they do add something to the effectiveness of our teaching, this 
fact remains to be demonstrated with other measures than those used 
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in studies repcrted to the present time." (Noll, 1939, p. 357). The 
reader should take note, however, that the feedback mode has been 
confounded with the frequency mode. 

. Ross and Henry (1939) conducted a study in a general psychology 
course at the university level. Both E and C received the same pre- 
test, mid-semester examination, and final examination. E also was 
given weekly objective tests. On the final examination, E achieved 
significantly higher than C. However, in an identical study in edu- 
ca t iona 1 psychol ogy, oppos i te significant results were obtained. 

Sumner and Brooker (19^^0 dealt with daily testing in a gen- 
eral psychology course at the college level. The course records of 
200 students were analyzed. After tv;o or three weeks from the start 
of the semester, forty daily tests of matching type were given to the 
class; about 25 items were on each daily test. No control group was 
used; the investigators were interested only in predictive value 
of the average of the first five quizzes in re cn to the average 
for all forty quizzes. For the whole group, . = .82 - .0156; 
the z test for the difference in average percentages of the first 
five quizzes and all forty quizzes is statistically significant: 
'^2-tail ^ .002. The z test, however, appears to be invalid; re- 
lated samples can be compared, but the measures themselves must be 
taken independently of one another. Although the same criticism 
could be leveled at the r^^ calculation, the purpose fnr which it 
was computed (prediction) seems to make the method val'd here. The 
investigators conclude, "The standing of the students relative to one 



another at the end of the first 5 tests will be approximately the 
same as their standing relative to one another at the end of the hO 
tests." (Sumner and Brooker, 19^^^ pp. 323-32A) . 

As a further analysis, the students were divided (on paper) 
into a "Hi-Lo'* dichotomy on the basis of their averages for the first 
five daily tests; the "Hi" category consisted of the llA students 
with a five-test average of JOi or above, while the other 86 stu- 
dents formed the "Lo"- group. The same type of Pearson product- 
moment correlation coefficient as for the entire group above was 
computed separately for the "Lo" and "Hi" groups: r^. = .60 
and r^^ = .68. Hence , the au thors concl ude that the s tuden ts 
having an initially low average gain much more than their high- 
standing counterparts throughout the semester . 

Fitch, Drucker, and Norton (1951) studied the effect of weekly 
quizzes on achievement in an introductory college government course. 
The intact control class of 97 students had only the usual monthly 
quizzes, while the intact experimental class of I98 students had not 
only the usual monthly quizzes but also the weekly ones. Both 
classes met three days a week with the same professor and both sec- 
tions had access to voluntary discussion sections outside of class. 
Durfng the third class meeting every week, the last half hour for 
both classes was set aside for questions and discussion; however, 
ten minutes were taken away from this time for E to have its 
weekly quiz. Four one-hour monthly tests were given to both groups. 
The weekly quizzes that E received were on textbook assignments. 



k2 

not on the lecture material. The criterion variable v/as the grades 
of the five hour-long tests given to both groups (y) . Because of 
the intact classes, the covariate (X) v/as the preceding semester's 
governmen*: course grade (the government course was a two-course 
sequer.ce) . The regression of Y on X was found to be linear. 

For purposes of analysis, the degree of voluntary discussion 
group attendance was broken down into four frequency classes. A 
two-way analysis of covariance was used; methods (tv;o levels) by 
degree of discussion group attendance (four levels). To achieve 
equal total numbers for E and C, 91 students were randomly 
selected from the original E pool so as to use all of the 91 C 
students for whom complete discussion attendance records v;ere 
available. However, within the total equalized E and C groups, 
the corresponding frequencies for a certain level of discussion 
frequency were disproportionate; the analysis of covariance was 
adjusted accordingly. For the total sample, Y was found to be al- 
most normally distributed. Homogeneity of residual variance was 
found to be satisfied by Hartley's M-test. Using the method of 
disproportionate cell frequencies (nonor thogona 1 case) , F:i tch e^ al . 
found E^C ( .01 <r P <C -0^) » monoton i cal 1 y increasing achievement 
in relation to increasing voluntary discussion session attendance 
(.05^P<r.lO), and methods by d i scuss ion i nteract ion i ns i gn i f i cant 
(. 50 <!! P < . 75) . The investigators conclude, then, that regardless 
of discussion group attendance, weekly quizzes significantly aid 
achievement in college government courses. 
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One should note that the main effect (weekly quizzes) is con- 
founded v;ith other variables in the experimental group (grading 
practices). The reviewer believes that confounding of frequency 
of testing with grading procedures exists because the students 
were told that the quizzes would not be counted in their final 
grades. The question raised by the reviev/er is how one can have 
a test without a grade. Further, for reasons v/hich elude the re- 
viev;er, the investigators first proceed with separate analyses of 
covariance for methods and for discussion group attendance. Finally, 
they present the two-way analysis of covariance (methods £nd_ dis- 
cussion group attendance). What the two one-v;ay analyses of 
covariance v;ere supposed to accomplish, the reviewer does not know. 
Without the measurement of the interaction between methods and 
discussion group attendance, any discussion of either main effect 
by itself is suspect. Hence, only the two-way analysis should have 
been used to begin with. 

Guetzkow, Kelly, and McKeachie ( 1 95^ ""doi ng curricular research 
rather than more basic research--dea 1 t with three methods of teach- 
ing an elementary general psychology course in a university. Twenty- 
four intact classes were used (N = 25 to 35 for a single class). 
For the first class meeting of each week, the twenty-four classes were 
pooled into three large lecture sections (N = 250 to 300 for a 
lecture section); each large lecture section received the same one- 
hour lecture. For the other tv;o one-hour class meetings, the stu- 
dents reverted to their original grouping among the 2k classes. 



Eight teaching fellows taught three classes each; each fellow 
had to use all three methods. However, each class v/as exposed to 
only one method. It was during the small-class meetings (last tv/o 
sessions every week) that the experimental manipulations took place. 
In the first method (M^), a short completion quiz v/as given at each 
of the two small-class meetings; the instructor was told to lead 
and structure all discussions. In the second method (^2) , a dis- 
cussion atmosphere was created; the students structured the 
meetings. Other than course examinations, essay tests were given 
in M2 once every other v/eek. In the third method (M^) , indepen- 
dent study was stressed; the instructors v/ere there only for con- 
sultation. No quizzes v/ere given. Extra reference material was 
provided for making the independent study more flexible. 

The reader can see why the reviewer considers this type of 
experiment as "broad curricular research." The design of the ex- 
periment v;as dictated by purely on-the-job considerations without 
regard to con found ing. If any differences arise, one will neve r 
know what particular aspect of the superior method caused it to be 
so; the results have to be taken at face value, realizing their 
1 imi tat ions . 

* Each of the eight instructors had responsibility for each of 
the three methods; thus, hopefully, each instructor would not be 
prone to favor one 'method over the other just from sole usage of 
that method. "The time sequence of methods was varied from In- 
structor to instructor so that no method would consistently be 
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used for the instructors' first sections to be taught during the 
week. All permutations of the order of methods were used." 
(Guetiikov; e^ , 195^, p. I98 ) Students viere changed to different 
groups at the start of the experiment so that each class". . - had 
an appreciable number of v/oincn, veterans, etc., and mean score in- 
telligence and grade point average were perfectly matched." 
(Guetzkow et_ aj^. , 195^, P-I99 ) The students were given a ration- 
alization as to why the different methods were being used. 

Three major examinations were given to all classes throughout 

the course: a major unit test (covering the first four weeks), the 

mid-semester examination (after eight weeks), and (T|) the final 

examination. Other posttest criteria were also used: (T ) the 

2 

United States Army Forces Institute Examination in elementary 
psychology, (T^) a test made by the eight instructors of common 
misconceptions in psychology (corrected split-half reliability of 
.73) > (T/^) Duval I's Conceptions of Parenthood Test? (corrected 
split-half reliability of .80), (T^) McCandless' Scientific and 
Analytic Attitude toward Human Behavior, (T^) an att i tude-toward- 
psychology test made by the eight instructors (corrected split- 
half reliability of .75)> (Ty) the number of students planning to 
concentrate in psychology, and (Tg) the number of advanced psy- 
chology courses students want to take. 

Although separate one-way analyses of variance (better yet, 
analyses of multiple covariance) would have been irore appropriate 
for this three-method study, the investigators supply simple t tests 
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of the difference betv/cen the highest and lov/est n^ecns. No signifi- 
cant differences v^ere found for T^, T^, Tj^ , or T^. One v;ouid assume 
from the i nvos Li ga to rs * description that for each pair of methods 
being compared, df^ = Nj + - 2 = lA (usti.g intact classes as 
the unit of analys i s) . Then, on , > (.01 ^2- ta i 1 * * 

c > M3 (.05 « ^2-l^\\ ^ °" ^7' ^1 > ^2 (-02 < ?2^ta\\ 

< .05); and on Tg, Mj > (.01 <f p2-tail '^^^^ ^^'^^ ^^'^ 

deliberate attempt of the Investigators to make the three methods as 
different as possible, Guet^kow £t_ aj|_. conclude that the methods 
are not impressively different. 

The documentation of analyses is very poor. Although only 
t tests of the differences betv;ecn methods are provided, the in- 
vestigators make a statement that implies that they had made some 
kind of analysis of variance or covariance: "Not only are the 
differences between instructors not statistically significant but 
there was no significant interaction between instructor and 
method." (Guetzkow e_t^ aJL- > '95^, p. 202^' However, data is not 
provided. 

Maize (195^) investigated two methods of teaching composition 
in remedial English classes at the university level. Themes written 
in Class were treated as quizzes* One group wrote forty themes In 
class throughout the semester; each day's themes v;ere criticized 
and discussed in ctass. A second group used a combination of work- 
book drill (English usage) and the writing of fourteen themes; how- 
ever, the themes were corrected outside class by the instructor. 



^7 

The first group acliieved significantly higher on a test of English 
usage. However, here it must be noted that method of correction and 
method of feedback v;ere totally confounded with frequency of quizzing 
(writing the themes). This situation need not have been the case 
if proper design had been exert'-;d. No doubt the inequity of time 
allotment with respect to the use of workbook dr/11 could not be 
avoided; the testing time has to be gotten somewhere. However, there 
is no pedogogical logic or necessity for confounding the results 
with modes of correction and feedback. 

Mudgett (195^>) studied the effect of daily, weekly, and monthly 
testing on achievement in engineering drawing. The course was 
meant for first-year technical students (engineering, mining, metal- 
lurgy, mathematics, and so on). Eight Intact classes were used in 
this experiment; they were randomly chosen from a total of 21 sec- 
tions of the course. Then the selected classes were assigned at 
random to three different testing programs. Randomization had also 
been used at the registration period, where students had been as- 
signed to the original 21 classes at random. Further, the four 
instructors used in this experiment were assigned to the sections 
at random. All groups had testing time taken out of the laboratory 
drawing time, not from the lecture periods. Each group met for 
eight periods a v;eek: two hours for lectures and six hours for 
laboratory drawing work. For purposes of analysis, 18^ students 
formed the final sample. The design appears to be a type of 
balanced, incomplete-block design: 
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INSTRUCTORS 



TIME 



8:30 
10:30 



A 


B 


C 


D 


DAILY 


MONTHLY 


MONTHLY 


WEEKLY 


TESTS 


TESTS 


TESTS 


TESTS 


MONTHLY 


DAILY 


WEEKLY 


MONTHLY 


TESTS 


TESTS 


TESTS 


TESTS 



(Mudgett, 1956, p. 58). Attempts to keep teaching procedures rela- 
tively uniform were made by having all instructors follov/ a course 
outline and attend weekly meetings v/ith the investigator. 

Two classes (Ej) received a ten-minute quiz at the start of each 
class period. The students corrected the papers in class that same 
day, and the results were discussed at that time. Two classes (E2) 
received a thirty-minute test at the end of each week. The tests 
were corrected by the instructors over the weekend and returned the 
following class meeting, where discussion of the tests was provided. 
Four classes (E^) received a major unit test at the end of the fourth 
and ninth v;eeks of the semester, as well as the final examination. 
No other tests v/ere given. "These unit tests were machine scored and 
only the scores were given to the students; hence, the four classes 
in the Monthly Test Group knew their class standings but had no 
other specific information to be used to adjust study techniques." 
(Mudgett, I956, p. k) . 

More specifically, Ej (two classes) received 3^ ten-minute 
quizzes; ^2 classes) received 8 thirty-minute tests. E^ 
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(four classes) had only the four major tests. The four major tests 
formed the criterion measures for this experiment: Engineering Draw- 
ing Test Form .\ (Tj); Engineering Drawing Test Form B (T^) ; Engi- 
neering Drawing Performance Test (T^) ; and Engineering Drawing Theory 
Test (T^), Tj v/as given as a pretest and again at the end of eight 
weeks. T2 was given at the close of the first four weeks of instruc- 
tion, T^ and T^ were given as immediate posttests. The 50-item, 
multiple-choice T^ had reliabilities of ,86, ,81, and .80 for three 
different samples when Hoyt's analysis of variance technique was 
used. The ^0-item, multiple-choice had a corrected Spearman- 
Brown reliability of .8^ and a Hoyt reliability of .81, The 50- 
item, multiple-choice T^ had a Hoyt reliability of ,65. The 200- 
point, combi nation- type T^ had a Kuder-Ri cha rdson 21 coefficient 
of ,96, 

Analysis of covariance was used for evaluating the criterion 
posttests of Tj; T2, T^, and T^; T^ (as a pretest was used as the 
covariate for all criterion tests, and T2 was also used as a separ- 
ate covariate in the case of T/^, The covariates were chosen on 
the basis of their significant correlation with the posttest cri- 
terion in question. For each of the five covariance analyses (two 
for T/^) , the Welch-Nayer Lj test for homogeneity of residual variance 
was run, as well as the usual analysis of variance test for homo- 
geneity of regression coefficients. All covariance analyses were run 
with the usual between groups and within groups breakdown (seven 
df for between groups and 175 df for within groups). On T^ , the 
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adjusted means were insignificantly different (.05<rP <^<C.10). 

On T| used as a posttest, the adjusted means v/ere significantly 
different (P <C<^ .005). Hence, for multiple comparisons were 
used next (one df for between groups and 175 df for within groups). 
E] > (. 10 < P < .25) . ^2 > E^ (.25 < P) . However, the compari- 
son of E| versus E2 was not given. Using a more detailed factorial 
analysis of covariance, Mudgett states that the interaction of in- 
structors by methods v;as not significant at the .05 level. The 
original significance of the overall F ratio was caused by the 
significant comparison of instructors (A+B) versus instructors (C+D) . 

On T^, the adjusted means are significantly different (P<C<C 
.005). Therefore, multiple comparisons are again used. E| y E^ 
(P «.005). ^2^^^} (•'O'^P'^ -25). However, the comparison of 
E| versus E2 was not given. The interaction of methods by instruc- 
tors was not significant at the .05 level. 

On T^^ (still using the Tj pretest as the covariate, as in the 
analyses of covariance of Tj , T2 , and T^) , the overall adjusted means 
aresignificantly di f feren t (.01 <^ P<^ .025). However , th i s s i gn i - 
ficance was caused by the significant comparison between instructors 
(A+B) versus instructors (C+D) , and not by any multiple comparisons 
between methods. Or the other hand, using high school rank as the 
covariate instead of pretest Tj , the overall adjusted means are 
again significantly different (.025 P <C .05). However, the in- 
vestigator does not pursue this analysis any further. He concludes, 
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. . there is no evidence to support the belief that students' in 
engineering drawing v/ho are given tests similar to those used in 
this investigation as frequently as once a week will learn any more 
effectively than the students v/ho are given tests once a month.'' 
(Mudgett, 1956, p. 166). 

Fattu (1957) reports briefly on a program of frequent informal 
achievement testing in an elementary engineering course for Navy 
enlisted men. The tests were directed more toward performance 
skills in the shop than toward theoretical learning in the class- 
room. Hov/ever, the frequent performance tests vjere directly 
related to the engineering theory the men had learned in the 
classroom. The investigator compared the final cJ_as_s_roorn exam- 
ination scores of classes that had passed through the program of 
frequent performance testing and earlier classes that had not. 
Although he does not provide tests of significance, Fattu presents 
the final classroom test means of tv/o schools in which the improved 
testing programs v/ere introduced: Ej - Cj = 35.2 and ^2 " ^2 ~ 
^0.2. It appears, then, that frequent testing in areas related to, 
but not a direct part cf, classroom learning can increase classroom 
ach f evement . 

Standlee and Popham (i960) studied frequent testing in an intro- 
ductory educational psychology course for undergraduates. Four intact 
sections with a total of 10^ students were used. All sections were 
taught by the same instructor. Section A had weekly quizzes of a 
twenty-Item, true-false type; the instructor-corrected quizzes were 
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counted toward the students' final mark and were returned the next 
class period. Section B had the same v/eekly quizzes; the students 
corrected their own papers, and the grades did not count in the 
final mark. Section C had the same quiz material presented to them 
in a reading fashion by the instructor; he then answered the ques- 
tions verbally himself. Section D had no quizzes in any form. For 
each section, the investigators postulated theoretical bases in 
various combinations of extrinsic motivation, knowledge of results, 
psychological structuring, and enforced activity v;ith test subject 
matter. Sections A, B, and C each received a total of thirteen 
quizzes throughout the semester. 

A 100-iterj), multiple-choice pretest was given to all four 

sections at the start of the study. 

At midsemester, a different 100-item multiple- 
choice type examination v;as administered to 
all subjects. Fifty of the test items vjere 
common to the pretest; 50 items were new. At 
the end of the semester, a 150-item multiple- 
choice type examination was given to all 
subjects. The test items included the other 
50 items of the pretest, the 50 new items 
of the midsemester examination, and 50 nev; 
items. (Standlee and Popham, I96O, p. 323). 

Analysis of covariance with the pretest scores as covariate was 
used. On the mid-semester examination, the adjusted means of the 
four groups were significantly different only in a marginal sense: 
Standlee and Popham claim P \ .05; however, in fact, »05 < p < .10. 
Then, using multiple comparisons between specific adjusted means, 
only (A-D) was significant: .02 <[ P <' .05. 
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On the final examination, the adjusted means of the four groups 
were not significantly different: .10<f P <C .25. The investigators 
admit their design is confounded: frequency by correction by grading. 
The investigators conclude, . . the use of quizzes will tend to 
increase students' achi evemen t of subject matter early in a lecture- 
d i scus s ion type of cou rse , . . . but . . . the significance.of the 
increase in achievement is lost by the end of the course. . . 
(Standlee and Popham, I960, pp. 32^-325). 

Selakovich (1962) dealt with frequent testing in an introductory 
college course in American Government. Two classes were randomly 
created so that 19 students were in each. The students were then 
matched on the Cooperative American Government Test (Form Y) . Both 
E and C were given three instructor-made, hour-long examinations 
during the course and Form X of the Cooperative American Government 
Tes t ^s a f i na 1 exami na t i on . In add i t i on , E was g i ven twe 1 ve 
unannounced qu i zzes throughout the semester. Hence, one will not be 
able to determine whether or not any significant or insignificant 
differences between E and C are caused by frequent testing or by 
the effect of nonannouncemen t ; the design is confounded. 

The investigator used a t test for related samples. For the 

difference in cumulative means for the three major hour tests, C ^ E 

(P^ y .20). For the difference in means on the standardized 

ta I I 

post test, E y C (. >0<( p2-tai 1 -^O) • This is a strange reversal 
of results. Further, compa ring the parallel-form, pre- and pos t- 
tests, ^2 " ^] ~ ^'^^ ^2 "* ^1 " 3.84; apparently, as 
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measured by the standardized tests, the students were not affected 
very much by the intervening course. It would be inleresting to see 
whether or not, using Form Y of the CAGT as a covariate, the re- 
sults for hourly tests and for the standardized posttest would be 
much different if an analysis of covariance is used. 

(V 

Laidl/^; (1963) studied weekly and nx)nthly testing procedures 
during tv;o semesters of a general psychology course. Each of three 
instructors taught a pair of classes; which of each pair of classes 
was to be the E group was randomly determined. Each of the six 
classes met for three one-hour sessions each v;eek for 16 weeks a 
semester. Every third class hour, a 15-minute, 20-item, multiple- 
choice test v;as given to the three E groups. Results of these 
quizzes were made knov;n at the next meeting but no discussion of 
questions was permitted. The three C groups were tested every 
fourth week with a one-hour, 80-item, multiple-choice test. The 
same feedback procedures as with E were used. Each of these 
monthly tests had 60 items in common with the equivalent weekly 
tests. The experiment was continued during the second semester to 
obtain measures of delayed achievement. Out of a total of 151 
students, only 120 became the pool for analysis because of incom- 
plete pre-exper iment records. The three E groups had an original 
total of 87 students (but only 69 workable records), while the 
three C groups had an original total of 6^ students (but only 51 
workable records). In going from the first semester to the second 
semester, attrition brought the total working number of students 
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for E down to 53 and for C down to hO . 

Before the experiment began, 0^ frequency analyses were run on 

certain "face-sheet" information: sex by methods (.50<rP^ . . , <C -70), 

^ ~ ta I 1 

class (freshman, sophomore, junior, or senior) by methods (-SO^^P . 

^ " L a I 

« .70) , and curriculum (liberal arts, business, or engineering) 
by methods (.30 <^ '^2~tail ^ '50) • Thus, the two sets of classes 
did not differ significantly on various categorizing criteria. For 
immediate achievement from the first semester. Laidlaw's covariate 
was a 120-item objective test made by him on both verbal ability and 
knowledge of psychology (corrected split-half reliability of .9A). 
The final examination for the first semester was a 150-item objec- 
tive test made up by Laidlaw (corrected split-half reliability of 
.96). 

For the immediate achievement analyses of the first semester, 

the investigator states. 

The data for the three course sections in each 
treatment group were treated as one sample. The 
combination of data was justified by the deter- 
mination that the variances on the covariants 
for sections within treatment groups were homo- 
geneous, and because there was no interaction 
between instructors and conditions of testing. 
The variances of the treatment groups on the 
covariants were homogeneous. The simple 
analysis of ccvariance technique was used 
to test each hypothesis. (Laidlaw, 1963, 
p. 26) . 

On the first-semester final examination, methods were insig- 
nif icant (F <^ .5) . 



In connection with the analysis of immediate achievement from 
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the first semester, one might ask v/hy the investigator went ahead 
with the one-way analyses after he apparently went through all the 
labor of the more i nformat i ve tv/o-v/ay (methods by classes--or, 
what is the same, instructors). This procedure appears to be 
sophisticated "data snooping." On the other hand, perhaps such 
a procedure is desirable in that pool ing across instructors (or 
classes) after one knows such variation is negligible, results 
in a more refined error term when one goes to a one-way analysis 
with the student as the unit of analysis. 

Two 60-item postsemester tests vjere developed: 30 items came 
from the first semester final examination and 30 items from the 
second semester. The first delayed posttest had a corrected split- 
half reliability of .86, and the second, .8A. Five weeks after 
the end of the first semester, the first delayed posttest was 
given. After another five v;eeks , the second delayed posttest 
was given. Strangely enough, Laidlav; used the first semester's 
immediate posttest as the covariate for both delayed posttest 
analyses. On the first delayed posttest, methods were insigni- 
ficant (.lO'C P <C .25)- On the second delayed posttest, methods 
came out similarly (.]0'«sP <C-25). 

' The investigator concludes, "The study demonstrated that the 
belief among co 1 1 ege teachers and those who write on educational 
methodology that frequent testing is a useful means for controlling 
student learning behavior is not well founded." [underlining 
inserted by reviev/er] (Laidlaw, 1963, p. A6) . 
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The rev i ewe r i s somewhat skep t i ca 1 of the p rocedu re of using 
the first semester's posttest as the covariate for postsemester 
retention. This procedure appears to be highly suspect, because 
such a covariate is not independent of the first semester's treat- 
ment effects. In fact, such a covariance procedure makes It 
harder to obtain defferences on delayed retention, since, in effect, 
one is obliterating the very differences that he is interested in 
to start with. The covariates in both analyses (immediate and de* 
1 ayed) shou 1 d have been the same : the very first p re tes t , wh i ch is 
unaffected by treatment differences. It is true that one wants the 
covariate to be correlated with the criterion measure, but one also 
v/ants the covariate to be independent of the very treatment effects 
he is trying to measure. 

The above study by Laidlaw concludes the first review topic: 
frequency of testing. Apparently, only one study was ever done in 
this area at the elementary school level. One study touched upon 
junior high school, while six studies concentrated on the senior 
high school. From 1929 to 1963, nineteen studies were conducted at 
the post-high school level. The results have been inconclusive. 
Poor experimental design and inadequate statistics have produced 
largely mean i ng 1 ess resu 1 ts . The second major topic of this review 
to follow (test grades as an incentive to further achievement) is 
even less well researched. 
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REV lEW TOPIC TWO : INFORMAL ACHIEVEMLfiT 
TEST GRADES RELATION TO TESTING AS_ A LEARNING DEVICE 

The second topic In this revicv/ concerns the use of informal 
achievement test grades as extrinsic motivation; this second topic 
concerns the idea of the students knowing a test grade v/ill be 
received that might affect their future success, might push them 
on to greater heights in achievement. This topic is closely re- 
lated to the student's concept of "test": how meaningful it is 
as an incentive. As already stated in the introduction, this 
second review topic will deal only with test grades as an ex- 
perimental variable and not with other types of grades and 
extrinsic motivation (for example, report card term grades , gold 
stars, awards, citations, and so on.). Since this second review 
topic is of major concern to the reviewer's dissertation experi- 
ment, references will be examined and criticized in detail. 

Nonresearch References : Arguing against the use of test grades 
as incentives, Odell (1928, p. 17) says, "If a teacher is skillful 
enough to motivate the work of her pupils to a sufficient degree 
by other means than checking up on their v/ork and appealing to 
their desire for h i gh marks , less testing wi H be needed than if 
it must be employed for that purpose." [underlining inserted by 
revi ewer] 

Another negative criticism of test grades as incentives is 
given by Kneeland and Bernard (1953, p. ^99) : "But take away the 
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grading aspect, the positivencss of riglits and v/rong5--and v/hat is 
onerous v/i ch a grade is suddenly cha I I eng i ng--a gdme, a maCchIng of 
v/its, a sourcespring of earnest discussion, in v;hich the objective 
is not defense but learning the truth." The authors also assume tfiat 
the v;ord "test", even though test grades might not be given, is 
meaningful enough to students as an incentive in itself. (p. 500) : 
"Objective tests, when not graded, do much to arouse interest and 
to give variety." This is an interesting problem that has yet to 
be solved. 

Experimental Studies : In contrast with the first review topic, 
frequency of testing, no breakdown will be made on the basis of 
educational level because of the dearth of material. 

Panlasigui and Knight (1930) found that students given arith- 
metic drill material in fourth grade without external motivation did 
more poorly than tfiose given both the drill material and an external 
incentive. The external motivation used was a progress chart on 
the v/all based on the students' drill-test grades. A total of 56 
intact classes from 10 city school systems v/ere used; the cities 
were located in the V/est and Midwest in various states. From each 
school, at least two E groups and at least two C groups v/ere formed. 

Each E and C class received the same weekly drill material: 15 
problems of mixed t,ype (as compared to "isolated" drill material of 
only one type). Only whole numbers v/ere used in these previously 
learned arithmetic skills. The C and C classes received the drill 
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malcridls only one day o v;eck for 20 weeks. The E ireatmcnl con- 
sisted of the use of individual progress charts and class progress 
charts. A protest and posttest were given lo all 5(> classes. 

Out of a total for all E and C classes of 938 students left 
at the end of the cxpLirimcnt, 358 matched pairs v;ero possible for 
analysis; the r.icjtching was on the basis of pretest scores. On 
botfi E and C a/ialytical groupings, six was evenly divided. Simple 
z statistics were used throughout the analysis. On the posttest, 
E > C ( p2'-tai 1 • "^^'^ • investigators claim (p. : "A 

just interpretation of tlie ornitted data v;arrants the statement that 
the gains were a bit slov; in appearing, but that v;ith increasing 
sureness the Experimental Group responded m.ore successfully as time 
v;ent on. In other words, the novelty of the progress chart idea 
did not stimulate a spurt of effort which then tended to die out, 
but, rather, an opposite effect appeared.'* 

Breaking the analysis down finer, on sex it was found that Eg ^ 
Ep (not significant but no definite probability value v;as given). 
Then, dividing the 3S8 pairs into four quarters of ability on the 
basis of pretest performance, for the top quarter, E^C (P 

2 " t a I • 

<.00l). For the lowest quarter, C > E (P = 0.86). Al- 

2- ta il 

though the investigators did not give the exact results for the 
middle two quarters, it was said that the second highest quarter 
was stiil significantly in favor of E. Paniasigui and Knight 
(1930, p. 615) conclude, "The beneficial effect of av/areness of 
success, then, was substantially in direct proportion to the amount 
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of success available for mol i vaL i on . " 

Fay (1937) studied tlie effects of two test grading systems 
upon undergraduate juniors and sophomores in introductory psychology. 
At the start of the experiment, I96 subjects were available. The 
large class was broken up into an E group of 89 and a C group of SS. 
E used the ''open" test marking system (to be distinguished from 
the less related semester marking system); each student could find 
out his monthly test grade and final examination grade in terms of 
A, B, C, D, and F. The C group used the "closed" test marking sys- 
tem; after each test, they could find out their test grade only in 
terms of (satisfactory), D, and F.' 

Both groups were taught by the lecture and quiz method. Two 
classes a week were devoted to lectures; the lectures were handled 
equally by tvjo instructors. In a somewhat contrived manner, once 
every week the two groups v/ere broken down into eight quiz and 
discussion sections. Once every fourth week, a 125" or 150-item 
objective test was given. The immediate posttest consisted of a 
AOO-item objective test. 

For purposes of analysis, the groups were matched on both 
percentile rank on the American Council on Education Psychological 
Examination and score on the first monthly test in the course. Six 
separate analyses w^ere carried out: four monthly tests, the final 
examination and the difference between the first monthly test and 
the final examination. Within each of the six analyses, ability 
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categories of A, B, C*^, and C were formed on the basis of the first 
test given in the course. Unfortunately, no information at all was 
given about the categorizing test; one hopes that it v;as different 
from the four monthly tests used in the analyses, since one needs a 
control variable to be independent of the criterion measures. Through- 
out the six analyses, simple z statistics are used; separate two-way 
analyses of variance would have been more appropriate. The sixth 
analysis (difference between first monthly test and final examina- 
tion) v;i 1 1 rot be considered in the presentation of results in this 
review, since the calculations appear to be invalid. Between the 
two tests, one has reflections in his difference scores of different 
difficulty levels and different content materials. This criticism 
is especially applicc'ible to standard rav; scores (Hull's formula: 
X = K + SX|) used in all analyses; even if norma 1 i zed scores had 
been used, the reviewer v;ould still doubt the validity of the cal- 
culations. 

Because of the university's academic attrition policy, after 
eight v/eeks the E and C groups consisted only of students having an 
average better than D. Hence, a 1 1 ana 1 yses throughout the who! e 
experiment were conducted only on such subjects. For students of A 
abiljty, E C for each of the four monthly tests; only the third 
test approached significance (P2~tail ^ .11). On the final examina- 
tion, students of /{ ability yielded E> C (p2~tail ^ -002). For 
students of B ability, on all four monthly tests, C ^ E; only the 
second, third, and fourth monthly tests are worth discussing: 



63 

P2-tail = -O''- P2-tail = " ' ^ ' ^"'^ ""2- ta i 1 = .02, respectively. 
On the final examination, students of B rbility yielded C^E 
^^2-tail ^ .008). For students of C"^ ability, contradictor/ re- 
sults were obtained: E C on the first and third monthly tests, 
while C E on the second and fourth monthly tests (all results 
highly insignificant). On the final examination, students of 
ability yielded C^E (again, insignificant). For students of C 
ability, E on all four monthly tests. Only the first and fourth 

monthly tests are worth mentioning: P2-tail ^2-taIl " 

•02, respectively. On the final examination, students of C ability 
yielded E^ C (not significant). 

Fay {]3'^7, p. 551) rationalizes his rather contradictory results 
"In other v. , if students securing an A on the first test knew 
their marks, they apparently put forth extra effort to retain their 
positions. The C students in the experimental group attempted to 
improve their standing. The B and students, on the other hand, 
were apparently satisfied, did not unduly exert themselves, and 
consequently declined relative to the rest of the class." 

Bostrum, Vlandis, and Rosenbaum (1961) claim that experimental 
evi dence in rea 1 i s t i c classroom situations is rare on the prob 1 em of 
providing extrinsic motivation (test grades, gold stars, prizes, and 
so on). The investigators studied changes in attitudes when rein- 
forced by randomly*ass igned test grades. A total of 228 undergradu- 
ate students in communication skills classes were the subjects. All 
subjects were given an attitude questionnaire consisting of four 
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10-item scales (federal aid to education, legalized gambling, capital 
punishment, and socialized medicine). The scales on federal aid to 
education and capital punishment v;ere not used in the analyses be^ 
cause of nonsat i sf act i on of measurement assumptions. About six 
weeks after the attitude questionnaire was given, all subjects 
v;rote a half-hour essay on either legalized gambling or socialized 
medicine. Each subject v;as assigned the topic he had evidenced 
the strongest attitude on the questionnaire six weeks earlier. 
Marks of A, D, or No Grade were randomly assigned to the essays. 
The next class period the essays with the randomly assigned grades 
were returned to the subjects; during that period, each subject 
vjas given botli attitude scales (legalized gambling and socialized 
medicine). Bostrum et al. (1961, p. 113) say, "Finally, subjects 
were asked to indicate their satisfaction with the essays." 

The final analyzable sample consisted of 127 students. The 
analysis for mean attitude change (from pretest questionnaire to 
posttest questionnaire) yielded heterogeneity of variance; hence, 
the J nves t igators used Cochran's approximate t method for testing 
the significance of the difference betv;een means of change scores : 
A > D (P < .01) ; A > Mo (P < .05) ; and D > Mo (P > .10) . This 
analysis was only for the one posttest attitude scale (out of 
two different posttest scales) related to the essay written by 
that particular student. It should be noted that Cochran's pro^ 
cedure v/as inappropriate here, since the investigators had three 
samples,, not tv;o ; therefore each of the three comparisons falls 
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to take into account overlapping variance. An analysis of variance 
should have been used in connection with a stabilizing transformation 
of scores. 

Continuing v;i th the analysis, Bostrum et al. (lS6l, pp. 113"11^) 
say, "An analysis of mean change in relation to initial position 
indicates that those who had initially assumed a favorable position 
on each of the issues . . . changed significantly more ( . . . p 
<C .01) than those who were unfavorable." Further, Bostrum et al. 
(1961, p. llA) claim, "By comparing the change scores of subjects 
v/ho had written an essay on a particular topic vyith those who had 
not vyritten on that topic . . . [it was found that] This difference 
is significant (. . . p ^ .0\) suggesting that the writing of an 
essay, independent of grade received, produced change in attitude." 

Satisfaction v;ith grade received was also investigated. A 
analysis v;as performed on the distribution of responses (satis- 
fied and not satisfied) for each of the essay grade levels (A, D, 
and No Grade) . 1 1 was found that P^C.OOl. Finally, it was con- 
cluded (p. IIA), "The results suggest support for the hypothesis 
that a 'good^ grade serves to reinforce the behavior for which 
it has been administered." 

Hav;k and DeRidder (1963) also claim that actual experimental 

evidence is rare regarding i ncen t ives in realistic learning situa- 

« 

tions. The investigators used two grading procedures with college 
students. In one section, students' course grades were determined 
in the usua 1 v;ay : by course tes t performance and a term proj ect . 



66 



In the other section, the students' course grades were determined by 
their cumulative grade-point average computed up to the start of the 
course. 

Four sections of educational psychology at the uni vers i ty level 
comprised a total of 118 subjects; most were sophomores and juniors. 
Each of two instructors taught two sections. Hawk and DeRidder (I963, 
p. 5^^8) say, "Each section met tliree times each week for a 50-minute 
period, Monday, Wednesday and Friday." The assignment of teachers 
to sect ions had a 1 ready been de te rmi ned by the admi nistration, 
probably in a nonrandom manner. Also, it appears that a nonrandom 
procedure was used to select the E section from the two sections of 
each instructor. To combat the contaminating effect tliat instruc- 
tors' knowledge of differences in methods might have on the effec- 
tiveness of teaching, the students were told of the experiment but 
a deliberate attempt was made to keep the instructors ignorant of 
the procedures. It seems to the reviewer that the subjects should 
also have been kept ignorant of the fact that an experiment was 
under v/ay. Tv/o 60-item, multiple-choice unit tests and a 100-item, 
multiple-choice final examination formed the criteria. 

Hawk and DeRidder (I963, p. 550) say, "Mean scores for the two 
groups taught by each instructor were tested for possible signifi- 
cant differences, but all differences betv;een instructors were so 
slight as to be ins'igni f icant." However, no details are given as 
to exact results or to methods of calculation and how they tie in 
with later statistics. Then, using t tests with II6 df, on the 
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first unit lest, v;i th E representing the predetermined grade group, 
C > E (.01 < P^^^gl ] < .02) . On the second unit test, C> E 
(.01 < P2-tail 0" ^'^^ final examination, C > E (P2"tai! 

« .00]). On the case study, (.05 < P2"tail ^ ^^'''^^ 
DeRidder (1963, p. 550) conclude, "Such findings raise questions 
about the validity of arguments of many educators that grades 
destroy motivation." 

Nolan (196^) performed an experiment very similar to that of 
Hawk and DeRidder (I963). Nolan also studied the arbitrary assign- 
ment of grades to an essay test but this time with respect to the 
effect on subsequent test performance rather than attitudes. Two 
intact classes in undergraduate educational psychology were used. 
The class with 99 students met Monday, Wednesday, and Friday at 
9:00 A. M. , while the class with 126 students met the same days 
at 10:00 A. M. In the total body of students, there were 68 males 
and 157 females. The course carried six credits' weight rather 
than the usual three. Each class day, one hour was spent in a 
large lecture class where tv;o instructors taught as a team; the 
second hour was spent in small discussion sections where graduate 
assistants were in control. 

During the third week of the semester, on a Monday, it was an- 
nounced that an essay test would be given that Weonesdjy. A three- 
question essay test was then given on Wednesday. The papers were 
graded outside of class by the instructors and returned at the next 
meeting (Friday). Nolan (196^, p. 36) slates, "Those students for 
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whom there was no grade-point-average, were assigned grades of B and 
D equally so that it would appear to the students that thefull range 
of grades had been used." No discussion of results at Friday's class 
was permitted, and the papers were returned to the instructors at 
the end of class. On the following Monday, an l8-item, 5"Choice 
criterion test was given; it was found that the KR 21 coefficient 
was .81. This objective test covered the next assignment and was 
not directly related to the content of the essay test. 

One control variable was cumulative grade-point average (GPA) . 
A second control variable was to have been the score on the Work 
Persistence Attitude Scale (V/PAS) . The 20-item, 5"point-cont inuum 
type WPAS was given at tie first class meeting. Ten items dealt 
directly with work persistence attitudes, while the other ten items 
vjere distractors chosen from a standard personality test. However, 
it was found that the WPAS possessed almost no discrimination power, 
and the 'nternal consistency coefficient was only .32- (computed by 
analysis of variance). Hence, WPAS was discarded as a second con- 
trol variable. 

Three se pa rate analyses of variance we re per formed : all subjects 
as a whole, females only, and males only. It seems to the reviewer 
thai., if separate analyses by sex can be performed, then it would 
have been better to include sex as a second control variable in the 
whole-group analysis; then measures of interactions with sex could 
have been gotten. For all subjects, as one might expect, the con- 
trol variable of GPA was highly significant (P-^.OOS); all other 
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main effects and interactions were highly insignificant (P ^ .25). 
For females alone, GPA di ffered significantly (.005 < P <' . 0 1 ) , whi 1 e 
treatments and treatments by GPA were insignificant {? )> .25). 
Strangely enough, for niales alone, the control factor of GPA was 
highly jjis i gn i f i cant (P ^ .25), while treatments were less insig- 
nificant (. 10 <C P < .25) ! Again, however, treatments by GPA were 
insignificant (P ^.25). 

The investigator suggests several limitations of his study: the 
failure of V/PA5 to function as a control variable, tlie inability to 
administer treatments more than once because of ethical considera- 
tions, and the possibility that college sophomores are so ingrained 
with school procedures that a single essay quiz does not make much 
difference one way or the other. Nolan (196^, p. 6I) lists possi- 
bilities for further research: "The development of a sensitive and 
sophisticated instrument for the measurement of student attitudes 
toward the grades they receive is needed. . . . Studies are needed 
at various age and experience levels in order to determine where 
the grades assigned to students' work are mos t i nf 1 uent i a 1 . " 
The latter recommendation has strong relevance for the reviewer's 
dissertation experiment. 
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REVIEW TOPIC THREE : TEST CORRECTION 
WITH RESPECT TO^ INFORMAL ACHIEVEMENT TESTING 
AS A LEARNING DEVICE 

The detailed rationale for treating test correction method as 
an aspect of informal achievement testing as a learning device has 
already been presented in the general introduction. No more need 
be said here. Test correction will be reviewed only very briefly, 
since it is not of concern to the reviev;er for his dissertation. 
The only other topic in this review that will be done in detail is 
the tenth one : student attitudes toward informal achievement tests. 
Each of the briefer review topics (of which "test correction" is 
one) will still be separated into nonresearch references and ex- 
perimental studies; in turn, experimental studies v;ill be broken 
down only by broad results: positive, negative, or no difference. 

Non research Refe rences : Potential methods of having students 
correct tests have been described by various writers: Jeep (1933) > 
Smeltzer (1933a), and Lee and Symonds (193^). With respect to ob- 
jective tests, Davis (19^3, p. 530) says, "The burden of correct- 
ing short tests and written exercises may be shared by pupils in 
scoring the papers." Krause (1966) presents a similar argument. 

Experimental Studies : Gates (1921) has found favorable results 
for the student-corrected mode. Cocks (1929) found that tenth-, 
eleventh-, and twelfth-grade boys who corrected their own test- 
papers in physics did much better than the groups in which the 



usual teacher-correction procedure was used; girls were not used in 
the experiment. Cocks also found that, among the members of the 
student-correction groups, the younger, less intelligent ones bene- 
fited most. He obtained similar results in content areas other 
than physics. Buckner (1931) found a slight difference in favor 
of the student-corrected- test group over the traditional teacher- 
corrected- tes t g roup in a high school fo reign language cou rse . 
Curtis and Woods (1929) found the student-corrected mode was 
superior to the teacher-corrected mode in seventh-grade general 
science, eighth-grade general science, ninth-grade biology, tenth- 
grade biology, and eleventh-grade chemistry. Curtis and Darling 
(1932) replicated the 1929 experiment and found the same results. 
Finally, Curtis (19^^) found similar results in high school sci- 
ence; unfortunately his results were confounded with feedback mode. 



REVIEW TOPIC FOUR : TEST RESULT 
FEEDBACK AS RELATED TO TESTING AS A LEARNING DEVICE 



This review is confined to the use of traditional teacher-made 
tests used in realistic classroom situations; the reviewer will 
omit the voluminous literature from contrived laboratory studies on 
immediate and delayed reinforcement in trivial tasks with mechanical 
apparatuses. Further, the investigations on programed instruction 
dealing with feedback response frame schedules will not be con- 
sidered here. Such peripheral topics have been adequately reviewed 
elsewhere. Finally, it might be said that feedback and correction 
are very closely related and perhaps might be treated more appropri- 
ately as a single topic; hov/ever, for purposes of efficiency and 
analysis, the Ivjo topics have been kept separate in this reviev;. 

Nonresearch References : Opdyke (1927, p. 36) says, "If it 
[that is^ the frequent teacher-made test] can be kept short enough 
to permit children to finish and then to discuss it with the teacher 
in the same period it will have an immediacy of impression and 
effect that will prove invaluable." Symonds (1927, p. 533) claims, 
"One advantage of the new-type test is that it may be immediately 
scored and discussed, thereby making the most of the discussion when 
inte.rest in the test is running high. . . ." Weber (1929, p. 537) 
maintains, "Whether given at the end of the term or at the close of 
a certain unit of work, their results should be reviewed with the 
entire group. . . ." 

Kneeland and Bernard (1953, p. ^99) claim that teacher-made, 
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objective tests are often overlooked as to their value "To stimulate 
more and better discussion." Koester (1957) described a loose, non- 
control -g roup tryout "experiment." He made use of small discussion 
groups for talking about test results as compared to the tradi- 
tional intact-class, posttes: discussion sessions led b" the teacher; 
he claims that the latter is not as effective as the former. In 
relation to a loose, noncon t ro 1 -g roup tryout of true^alse tests, 
Flook (1959, p. 262) claimed, "In short, use of the test had made 
a valuable contribution to the course, in particular by improving 
the quality of discussion [afterwards]." 

Tyler (1959, p. 15) said, "Another policy v;hich can increase 
the positive values of testing is to use similar tests periodically 
throughout the instructional program and to reviev/ with the stu- 
dents their performance on each test. . . . This practice also re- 
duces the emotional tension surrounding testing. Testing becomes 
a natural part of the total learning process rather than an infre- 
quent and traumatic experience." Anderson (i960, p. 51), in dis- 
cussing the use to which classroom tests are often put, said, "Re- 
inforcement. . . haS come in for little specific attention. . . . 
the periods of time which elapse between the student's response and 
some of the meager reinforcements he does receive are frequently so 
long that most of the effect Is destroyed." Coladarci (196^, p. 
258) claimed, "Testing, as a part of the evaluation of the behav- 
iour [sic] in relation to the goals, must parallel the educative 
process in order to provide feedback on the progress being made." 



Finally, Dyer (1967) provided 0 learning theory model for instruc- 
tional feedback in terms of instructor, student, student environment, 
feedback, ai*-^ points in time. 

Exper Imcn!.al Studios : One of the earliest rca 1 i s t i c feedback 
learning experiments was that of Book and Norvell (1922). Tbty found 
that feedback in the form of immediate numerical scores (as opposed 
to letter grades) in simple arithmetic tasks was superior to no 
such feedback (simple practice without knowledge of results). Brown 
(1932) btudied feedback in fifth-grade and seventh- grade children 
in a large city schoo 1 sys tcni in connec tion with previously I earned 
arithmetic skills. In each grade, one group eraployed immodiale 
feedback in the form of a bar graph, while the other group got de- 
layed feedback after teacher correction. In both grades * he bar- 
graph method was most effective. Ross (1933) studied feedback from 
the standpoint of how much knowledge of results is given each stu- 
dent, Ross found no difference among four degrees of detaileclness 
in a tests and measurements class at the college level. 

Plo'wman and Stroud (19^2) found that subjects who got test re- 
sult feedback in the form of written teacher solutions of wrong 
problems reduced their errors by half a week later. Krueger (19^7) 
performed an experiment comparing students' honesty in reporting 
grading errors under different conditions. Angell (19^9) dealt with 
immediate knowledge of results in college chemistry at the fresh- 
man level by means of a punchboard. The punchboard group was super- 
ior to the usual machine-scored, delayed feedback group. Jones 
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and Sawyer (19^9) again found superior results for immediate feed- 
back by means of the punchboard as compared to traditional delayed 
feedback in an undergraduate freshman course- 
In Armed Services classroom training, both Stone (1955) and 
Bryan and Rigney (195^) found that the more complete the amount of 
feedback (for example, discussion of errors versus simple return 
of numerical scores), the better the achievement. Page (I958b, 
p. 173) ciaims, "Each year teachers spend millions of hours mark- 
ing and writing comments upon papers being returned to students, 
apparently fn the belief that their words will produce some reiult. 
in student performance, superior to that obtained without such 
words. Yet on this point solid experimental evidence, obtained 
under genuine classroom conditions, has been conspicuously absen* '* 
He performed a tightly controlled experiment with 7^ intact c^as** 
rooms in seventh through twelfth grades in several content areas. 
Three treatments were used: free comments, specified comments, and 
no comments. A monoton I cal 1 y decreasing performance level was 
found for the latter order of treatments. 

Sturges and Crawford (19^3) studied immediate versus delayed 
feedback v;i th realistic, factual material. Sassenrath and Garverick 
(1965) found that the teacher-led discussion method of feedback was 
superior to looking up wrong answers in the textbook, checking over 
answers from correct: ones on the board, and no feedback. The sub- 
jects were students in undergraduate introductory psychology. Paige 
studied feedback in ei fjhth-grade mathematics students. E received 
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immediate feedback in the form of special carbon copies of their 
tests which they kept after turning in their original tests; C had 
to waft as usual until the next day. E exceeded C in performance. 

Daniel (1968) studied feedback in undergraduate educational 
psychology classes. E received immediate knowledge of results on a 
teacher-made test; E had to look up correct answers to their mis- 
takes. C received knov;ledge of results a day later. Both E and C 
received discussion of results. Strangely enough, the delayed 
feedback group excelled over the immediate feedback group. Daniel 
and Witchel (196?) found similar results for one-week delayed 
feedback over immediate feedback in college students on teacher- 
made tests. 
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REVIEW TOPIC FIVE : PRETESTING AS_ AN_ 
ASPECT 0F_ TESTING AS^ A LEARNING DEVICE 

Considered positively, pretesting as an instructional device 
might be thought to involve benefits to both the teacher and the 
student in terms of diagnosis and, especially for the student, of 
structuring in his mind subsequent subject matter. Considered 
negatively, pretesting as a methodological device can taint sub- 
sequent measurements by the phenomenon of sensitization. 

Nonresearch References : Breslick (1921, p. 277) says, "The 
old type [of examination, namely, the essay test] is worth more in 
diagnosing pupil difficulties. The extent of the value of the new 
type [that is, the objective test] in diagnosis is yet to be fully 
demonstrated in history." Spencer (1923) talks about infor<7iaI 
achievement tests used as diagnostic devices in high school algebra. 
Horn and Ashbaugh (1926) recommend the use of the test-study method 
in teaching elementary school spelling. Cody (1929), Weber (1929), 
HcGlnnis (1929), Jones (1929), and Burr (1929) discuss the diagnostic 
values of testing. Breed (1930) recommends the use of pretesting 
in teaching elementary school spelling. 

Horn (1933) is also in favor of the test-study method of teach- 
ing elementary school spelling. Hutchinson (1933, P- ^36) said that 
tests ". . . should help the student organize his knowledge . . . 
[and] should give aid to both student and teacher in diagnosing the 
weaknesses ... in the student's knowledge." He was one of the 
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few writers to see the structuring benefits of tests. In relation 
to diagnosis, Smeltzer (I933a, p. 527) said, ''Much classroom testing 
is of little value from a teaching or learning standpoint because 
no further analysis is made of the test results" Davis (I9A3, p. 
528) claimed, "A desirable use of a test Is to survey at the be- 
ginning of a subject the pupil's previous background and the 
extent to which any abilities have already been developed." Lock- 
hart (I9A8) suggests using pretests to find out how much students 
already know. Kneeland and Berrtard (1953) said much the same 
thi ng. 

Experimental Studies : Kingsley (1923) found that the pretest 
method was superior to the study-test method of teaching spelling. 
Kilzer (1926) also studied tl. ; pretest method of teaching spelling 
versus the study-test method. The pretest method was superior. 
Watts (1928) was another to study the test-study method of teach- 
i ng spel ling. Jersild (1929) performed three experiments on pre- 
testing versus no pretesting. With multiple-choice tests and 
essay tests, positive results were obtained for pretesting, while 
pretesting with true-false tests gave negative results. Kirkpatrick 
studied the effect of pretesting in high school physics. The pre- 
tested groups were superior to the nonpretested groups. Keys 
(l93^b) , dealing with upper classmen and graduate students in 
educational psychology, found that for items on which subjects 
were pretested at the start of a unit of instruction, rates of 
achievement were higher than for nonpretested items. 



Gates (1939) studied the pretest method of teaching spelling 
with the usual techniques. In spelling, for third through eighth 
grades, the test-study method was superior to the study-test 
method. Luce (1939) studied seventh, eighth, and ninth grades 
on a geography passage on the basis of the methods of test-study, 
study-test, and just study. The study-test group appeared to 
do slightly better on all posttests given to the three groups. 
Tiedeman (19^8) found that subjects in fifth-grade pretested 
on geography passages were slightly superior on achievement than 
non-pretested groups. 

Final 1 y , seve ral i nves t i gators have cons ; de re'd the methodo- 
logfcal aspects of pretesting: Solomon (19^9), Hovland, Janis and 
Kelley (1953), Piers (1955), Lazarsfeld (1957), Campbell (1957), 
Lana (1959a, 1959b), Lana and King (i960), Entivisle (I96I), 
Campbell (1963), Edling (1963), and Rayder and Neidt (I96M. 
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REVIEW TOPIC SIX: RETESTING AS AN 



ASPECT OF TESTING AS A LEARNING DEVICE 



This section is directed toward those studies that have made 
attempts to compare performance of subjects who are retested several 
times with informal achievement tests, with those who are not, on 
the criterion of delayed achievement. It should be noted t^ -.t many 
studies have been done on recall and retention, but most have dealt 
with trivial materials in unnatural, 1 aboratory- 1 i ke situations; 
this review covers only the realistic learning situation experiments 
that approximate the classroom setup, 

Nonresearch References : Spencer (19^0, p, 1^) provides an ex- 
cellent description of the matter at hand: 

There are exceptions to the assumption that all 
re ten t i on cu r ves show a drop after the learning 
performance is ended, Ballard [1913], by an 
experiment in which pupils memorized a poem, 
found by retests day after day, that the scores 
went up during a period of five days following 
immediate testing, Ballard designated this 
process, which is opoos i te to forgetting, as 
remi n i scence , It follows that for there to 
be an increase in the amount of retention at 
delayed recal 1 , the material must have been 
i ncomp letely 1 earned originally. If recall 
at the close of learning has been complete 
no later recall can be greater; hence there 
can be no reminiscence, 

Woodworth [1938] terms the idea that a for- 
getting curve can rise is [sic] absurd and 
that it is impossible for one to retain 
more than was learned originally unless 
some other process enters to produce added 
learning, Woodworth advances the possibility 
that reminiscence is due to the involuntary 
or voluntary revi ev;i ng of the material by 
the subjects, Ballard admits this possibil- 
ity but does not believe thac the whole 
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effect of reminiscence could be so ex- 
plained, [underlining inserted by re- 
viewer] 

Experimental Studies : Yoakam (1922) found that testing retards 
the curve of forgetting more than just rereading written passages 
on which the tests are based. Jones (1923) was one of the first 
to study the reminiscence phenomenon to any great depth with 
realistic learning materials at the college level. In general, 
he found that retesting (in many different schedules) impeded 
the curve of forgetting as compared to nonretesting conditions. 
Spitzer (1938, 1939) found that, for geography passages with sixth- 
graders, the closer retests were to the initial posttest, the 
greater the curve of forgetting was retarded. Spencer (19^0) 
replicaced Spitzer's experiment, this time presenting the geo- 
graphy orally. Similar results were obtained. 

Sones and Stroud (19^0) studied retesting versus simple re- 
reading in different time schedules with a passage on geography 
for seventh graders. When retesting was used relatively close to 
the first posttest, it was superior, but rereading exceeded re- 
testing as one got further away from the initial posttest. Davis 
and Rood (19^7) found the remi n i scence phenomenon in arithmetic 
for three testings with the same test. Little (19^0) studied 
reminiscence in undergraduate biology students; the phenomenon 
was again present in the case of retesting versus nonretesting. 
He also discusses methodological issues with respect to the cal- 
culation of reminiscence scores. Finally, Celinski (1968) studied 
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retesting in graduate level electronics courses in two different 
universities; unfortunately, no control groups were used. Short, 
announced, repetitive tests were given quite frequently throughout 
the course. Each student moved at his own rate, but to progress 
further, he had to obtain a perfect test score; if he did not, 
he kept taking the test over until this occurred. Each subject 
evidenced increasingly better performance as the semester progressed. 
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REVIEW TOPIC SEVEN : TEST EXPECTATION 
AS AM ASPECT OF TESTING AS A LEARNING DEVICE 



The motivational aspects (and hence, the additional learning 
benefits) of test expectation (perhaps in the form of prior announce- 
ment of an impending test) have already been discussed in detail in 
the introduction. This review Sv^ction gives a brief overview of 
stud i es in this field. 

Non resea rch References : Gable (1936, p- 1), in connection with 
test expectation, said, ". . . educators seem to be agreed that 
pupils tend to accomplish more when confronted with the realization 
that a day of reckoning is at hand v;hen they are expected to give an 
account of their knowledge. Such a situation contains dynamic or 
motivating properties which aid the crucial aim in teaching-- 
/at ion. " 

Experimenta 1 Studies : Jones (1923), as part of his extensive 
series of retesting experiments at the college level mentioned 
above, also did significant work in this area. At the beginning of 
the hour-long lecture period, half of each class used was given a 
slip of paper that notified them of a five-minute quiz at the end 
of the period on the material of chat day's lecture; the other half 
of each class received a "dummy" library notice on a similar slip 
of paper. The results for pooled unexpected groups were almost 
Identical. Schutte (1925) used "normal school" students in an 
introductory education course to measure the effect of announcement 



of an impending test. The experiment v/as conducted tw^ce (once for 
each of two academic years). Two intact classes a year were used: 
one class expected a final examination, while the other did not. 
The results for the two separate years were pooled with "methods'' 
still being distinguishable. The expected g roup did superior to 
the unannounced group. 

Pease (1927, 1930) studied the effects of "cramming" and ex- 
pectation versus nonannouncement . Both high school and college 
subjects were used. On the day the test was to be given, E was 
told of the impending test and was instructed to "cram" for it 
in the time set aside for this purpose; C was given the test immedi- 
ately without announcement or time for equivalent "cramming". E 
did much better than C on both immediately and delayed retention. 
White (1932) compared a group that was told that a final examina- 
tlon would be given in the course and what types of material would 
be on it, with a group that was not told to expect a final examina- 
tion. The expected group was superior on the final test's per- 
formance . 

Corey (1935) was one of the first to do work in this field with 
real i s t i c 1 earn i ng materials and environment. Gable (1936, p. 5) 
said Corey ". . . compared correlations of Army Alpha scores of 10^ 
students with test results obtained from surprise quizzee, on the 
one hand, and on the other hand with results obtained from a final 
examination announced long in advance. His assumption is that 
achievement is motivated much more adequately in a final examination 



than in a surprise quiz," Gable (1936) studied subjects in ninth- 
grade biology. She compared three oups: a "pop""-quiz one, a 
preannounced one, and a nontest control one. She concluded that 
a mixture of announced and unannounced quizzes is the most effec- 
t i ve procedu re , 
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REVIEW TOPIC EIGHT : TEST EXEMPTION 
AS AN ASPECT OF TESTING AS A LEARNING DEVICE 



The reader will recall from the introduction that test exemp- 
tion has potential motivating properties in at least tv;o v;ays: 
exemption from course work by superior test performance, or exemp- 
tion from tests themselves by superior course work. 

Nonresearch References : Odel 1 (1928, pp. 51"52) seriously ques- 
tions the motivating properties of the process of exemption from 
the final test by classroom performance: "... it comes to be 
looked upon by pupils as more or le^s of a disgrace to have to 
take examinations. . . . The whole exemption system tends coo much 
tc make the examination a pjnalty and a disciplinary device rather 
than an integral and educative part of the i is t ruct ional process." 
Cole (1929, p. 120) said: 

It was a rule in the Seattle schools for some years 
to excuse all pupils of advanced standing from 
taking examinations. In other words, we made 
examinations a penalty, and it was considered 
somewhat of a disgrace to take them. As one 
teacher put .t, 'The only pupils in chis school 
who are taking examinations are those who will 
profit the least by ♦'eking them.' 

Nickerson (1929, p. 253) asko.d: • 

Should exemptions b> made? I grew up in a 
system v;here it v;as the rule to be excused 
from examinations if you made a certain monthly 
grade. Everyone strove to attain that aver- 
age, and made much better grades than they 
would otherwise have done. Oh, what a joy 
it was to get a few days* vacation! I think 
the teachers really enjoyed not having those 
extra papers too. Examination under that 
system became a penalty rather than a 
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'learning exercise.' All would have been 
well had I not gone to college and had to 
take exarnina L ion . . I had never had to 
take exa^.I nat i ^ns , and just the thought 
of the.Ti appalled me, and still does, 

Webb (1929, p. 282) al?i argues against exemptions from tests: 

The training which students get in cor- 
rect v;ritten expression through test?, or 
examinations is valuable for all classes 
of students, therefore not any s hou \ d be 
exenip t from exam i na t i ons . Graduates of 
high intelligence are oftimes hoidicapped 
in making good in important pos i tions be- 
cause they faile^^ to acquire the habit 
of producing elegant and accurate oral 
and written expression. Some of them 
were exempted from examinations, [under- 
lining inserted by reviewer] 

While the latter part of Webb's passage may be somewhat unscien- 
tific, his initial point is well taken, as was Nickerson's point 
that examinations are an integral part of our lives and to exempt 
people denies t'lem necessary practice in examination-taking skills. 
However, on the other hand, the motivating properties of exemption 
cannot be dismissed lightly. 

Finally, in reviewing the state of research in this area in 
his time, Davis (19^3, p. 533) claimeu, "Investigations dealing 
with the effect of exemption from the final comprehensive examina- 
tion, the extent to which It provides information additional to 
that the teacher already has by the time it is given, have not 
yielded answers sufficiently conclusive for generalization." 

Experimental Stud i es : Morley (1926) noted that, while superior 
students were not affected one way or the other by the exemption 
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procedure, the mediocre students gained more than they would have 
otherwise. Engelhart (1931) described an experiment dealing with 
exemption from the final tes t by course performance He con- 
cluded that this type of exemp.'on appeared to raise the per- 
formance of otherwise average students. Gould (1932) perfor.ned 
a surve^ of 125 secondary schools in ^8 states. He reported 
(p. 1^5), "No exemptions are permitted in forty-seven of the 
s i xty-one school s requ i ' ng T i na 1 exami nat i ons . Pup i 1 s - are 
exempted for many different reasons in the remaining fourteen 
school s 

Meltzer (1933) performed an experiment dealing with exemp- 
tion from a portion of course work every vyeek on the basis of 
weekly tests. The exemption procedure was superior to the tra- 
titional nonexempt ion one. Smeltzer (li33i and 193^b) performed 
a study in coPege chemistry very similar to the study of Meltzer 
(1933) • Smeltzer found that the extremes of the ability range 
were affected very favorably by the weekly exemption procedure 
in relation to the forced-attendance group. Remmers (1933) made 
a study of undergraduate engineering students. E was allov;ed 
exemption from the final examination on the basis of superior 
course performance, while C had to take the final examination 
regardless of previous achievement. E was superior to C on im- 
mediate retention but not on delayed retention. Finally, Dole 
(1951) > however, provided evidence on the procedure of exemption 
f_rom the course by tes t performance. In his study at the 
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college level, he concluded that the procedure was very effective. 
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REVIE\/ TOPIC n I r; E : STUDENT PREPARATION FOR TESTS 
AS AN ASPECT OF^ TESTING AS_ A LEARNING DEVICE 

A few theorisis and investigators have been interested in the 
v/ays that subjects prepare outside of cla^s for informal achieve- 
ment tests. Not much has been done in this area, mainly because 
It is difficult to control extra-class variables. However, most 
of the ncnresearch references on motivation arising from tests 
refer directly or implicitly to this external preparation factcr. 
Thus, any vague reference to the simple motivating pov/er of a test 
(without specifying just how such motivation is brought about) 
usually implies that the subject has been urged onward outside 
of class to greater preparation for the test. Such vague test 
motivation references are considered in this section of the review. 

Nonrescarch References : Weber (1929» p. 62) says the informal 
achievement test . . teaches the student to express his know- 
ledge accurately and concisely. As a preparation for this expression, 
the examination, if effective, requires a careful study and review 
of the course pursued.*' Pyrtle (1929, p. 119) claimed, **Tests when 
properly given are a stimulus or challenge to a student to more 
effective or more thorough work.** Weeks (1929t p. 281) asserted 
that Informal achievement tests **. . . act as a motivator. The 
knowledge of a judgment day seems to keep some folk on the straight 
and narrow way. Pupils v/ho know beforehand that tney are going to 
be held accountable for a given unit of subject matter will study 
more diligently, all other things being equal." Similar ideas have 
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been expressed by CoJvin (1913), Symonds (1927), Pearson (1929), 
Cole (1929), Verables (1929), and Krause (1966), 

In connection with the general problem of motivation from 
tests, Ruch (1929, p. 10) said, "It is unfortunate that we hove 
so little direct Information as to the motiva'iing effect of exam- 
inations. That examinations do have this value has been tacitly 
agreed but never proved/' Finally, Tyler (1959, p. 10) claimed, 
"Wei 1 -motivated students have commonly put extra time and effort 
into study when they thought they were soon to be tested," 

Exper i mental Stud i es : Douglass and TallmacjC (193^) made an 
intensive effort to discover hov; subjects prepare for examinations; 
so also did Meyer (193^, 1935). In all cases it was found that 
the announcement of an objective test produced different study 
methods than the announcement of an essay test. Class (1935) also 
performed one of the few major studies in the area of type of 
preparation used for tests. Briefly, he founc that most subjects 
used very different study habits for true-tdse tests as compared 
to essay tests, when such tests were announced in advance* Fur- 
ther, he found that -subjects performed best on both true-false 
tests and essay tests as compared to other types (completion, 
irult iple-choice, and so on) when type of test was not announced 
in advance. Finally, VaDance (19^7), studying senior high school 
students, again found a strong difference in study methods used 
for essay tests as compared to objective tests. 
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RLVILW TOPIC TLM : STUDLNT 
ATTIIUUES T0V;AF;D INFORMAL ACHILVEMEfIT TLSTS 

This section includes miscellaneous alliluJcs of students 
tov;r-»rd informal achicv( ..icn t tests. Although emphasis is usually on 
measures of pcrformaice and achievement in connection with studying 
testing as a learning device, measures of attitude are also very 
important. In connection with attitudes, test anxiety will also be 
considered in this revicv/ section. 

Deputy (1S)29), whose study o*. frequency of testing has already 
been described above, also investigated student attitudes toward in- 
formal achievement tests; no doubt lie was one of the first in this 
respect. He made a survey of attitudes near tlie end of the semester 
after the procedures had been changed (the reader is advised to look 
over the description of Deputy's experiment in the first reviev; 
topic). The respondents were told to remain anonymous. Deputy 
gives only rougf percentage statistics but no tests of signifi- 
cance : in L|, 86^ preferred daily written work; in E^, 85^; and 
in ^6<5. Deputy (1929, P- 333) comments: "at three different 
times soon after Section 1 had been changed from an experimental to 
a control section, the students as a class asked to continue the 
daily wri.tten class v^ork, their score in the mid-semester exam* ition 
was not so gratifying as that of Section 1. ... It is suggestive 
to know that the extent of the unfavorable attitude of Section 2 
tov/ard the v;rltten exercises is due to the fact that their written 
v;ork c-^me during the sc»cond half of the semester, after a half 
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semester of only oral recilolion work.** 

Turncy (1931) > vvt.o nlso did a frequency of lesling experiment 
described earlier, gave a questionnaire of yes-no type to the exper- 
imental group (the one that underwent frequent testing) at the last 
class meeting. Out of subjects v;ho took the questionnaire, 
Turney claims most were favorable toward short, frequent tests. He 
provides no exact data or statistical tests and did not give the 
questionnaire to the control group. 

Kitch (1932), the major part of v-^hose experiment v/as also des- 
cribed earlier in the frequency of testing section, also studied 
student attitudes. A questionnaire of yes-no7 type v/as given to 

the students at the end of the experiment. Although no tests of 

2 

significance were made on the frequency data, analyses would be 
easy to perform on the fifteen questions, since nonoccurrence was 
al!ov/ed for. However, ranking th^ possible benefits to be gotten 
from practice tests in order of highest number of positive responses, 
calling attention to points not noticed v/as first, needing to study 
sooner and hence more often was second, aiding in learning key text- 
book facts was third, and hinting at what the teacher considered 
important was fourth. An open-ended question was also put on the 
quest ionnai re. 

Lee and Symonds (l93^> P* 17^) provided evidence on two student 

test attitudes studies: 

Students in science prefer objective tests to the 
essay type according to the data presented by 



Hurd [1529] ond Dionond [1933]. Hurd found 
Ihnl pliyr, icr* sludcnls on Ihc college level 
preferred objeclive lests, largely because 
sucli lesls covered more ground. Dionond 
studied tlie preferences of liigh school bLu- 
dents finding lhal Ihey also preferred ob- 
jeclive lesls. Me also found thol pupils 
preferred lests made out by other pupils 
and lesls wlierc grapliic records of results 
were kept. 

Keys (193^»^), whose frequency of testing study was described 

in detail earlier, also dealt v;i ih student altitudes. At both the 

start and end of the term, he gave llie same 30-item, yes-no attitude 

questionnaire to both groups of his study. Using simple z tests for 

testing tlie difference from tfie start to the end of the term, Keys 

founJ t'wo quest ions to be particularly interesting: a significant 

i ncrease in the number of students favoring tests given every 

second, third, or fourth class meeting (P2-tail .001) and a 

significant decrease in the number of students favoring tests given 

only tl>ree or four times a semester (P^ <^.001), 

2-tail 

Noll (1939» p. 356), whose frequency of testing experiment v;as 
described above, provided rough percentage statistics but no tests 
of significance: •'These replies indicate. . , that about half said 
they would have enjoyed it [that is, the course] more if there had 
been occasional written tests, and more than three-fourths stated 
that they thought they would have learned more if there had been 
such tests.*' 

Bender and Davis (19^9) gave a questionnaire about teacher-made 
tests to 10^0 subjects in ^1 secondary schools (public, private, and 



parocliiol); the sdrnplc of schools v/os a proporL i onalc 1 y stratified 



one. Apparently, only tenth tlirough twelfth grades were used. No 
tests of significance v/crc provided; only loobe percentage statis- 
tics and inforrajl trends were given. In summarizing tlieir major 
results, tlie investigators said (p. 65): 

A higlily competitive situation exists for 
grades in secondary scliools; all students 
v/ish to be judged fairly and by uniform 
standards; and students desire to enter a 
test \/ith no advantage or disadvantage to 
themselves in comparison v/ith other mem- 
ber? of the class. 

Students in general consider any particular 
order of materials in examinations to be 
of little consequence altliough they show 
a decided preference for questions that 
stress problem-solving ability. Those 
who are unprepared for a test show a 
preference for multiple choice and true-* 
false items. Wlien they are wel 1 -prepared , 
their preference is for the essay and 
completion items. '^Cramming*' is con- 
sidered v/ortliwhile by a majority for 
essay, completion, and problem types of 
tests although many consider "cramming" 
worthwhile for all types of tests. 

A majority of students prefer difficult tests 
v/ith ample advance notice (2 to 3 days) 
to easier tests v-nthout previous notifi- 
cation. They also wish to knov/ what a 
test will cover and tlie kind of items 
that vn 1 1 be used. Almost all students 
desire that the papers be returned promptly 
with grades and corrections on them. Most 
students welcome tests as often as once 
a v/eek. Almost all students v/orry more 
or less about all examinations. A fev/ 
worry to such an extent that they are un- 
able to do their best work on a test. 

It is evident that most students work for 
grades and that they desire to have all of 
their papers scored and to have all of them 
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count OS crcilil lov.nrd their final grade. The 
returns indicate that a riajority have an out- 
look on tests thot is sufficiently well bal- 
anced and wholesome to serve as a suitable 
basis for the functioniny of psychological 
principles required for the effective use of 
tests as learning instruments. 

DeLong (1955) conducted an attitude study at the elementary 
school level. The survey v;as very loosely conducted and very 
narrow in scope; all tliat can be concluded from it are hints for 
further research of a "tight" type. All the elementary teachers of 
three school systems v/ere given a 10-question essay (that is, open- 
end) survey as to how they felt their students reacted to tests as 
compared to nontcsting situations. Also, the investigator had some 
of his university students go into the same elementary schools on both 
test and nontest days to observe the subjects' behavior. Finally, 
over 200 longitudinal case studies from the university's elementary 
laboratory school v/ere examined for the effects of test-taking. No 
firm conclusions could be drawn from the v;hole "experiment" other 
than that children act differently under a test situation than they 
do under a nontest situation. The investigator admits that much more 
research is needed on the emotionality of test situations as compared 
to nontest situations. The reader should note that DeLong's Study 
is a comparative approach rather than Sarason'<; "isolated" method; 
the latter takes v;hatever the child admits on the TASC to be his 
degree of test aiixiety; the former is perhaps more meaningful in its 
relativistic approach. 

Mudgett (1956), whose frequency of testing experiment was 



described eiirlicr, prepared Lv/o questionnaires: one for the sludcnls 
and one for the inslructors. All daily, weekly, and nonlhly test 
groups were given Itie questionnaires. Hov;ever, too many objections 
were put forth by the studerits to their questionnaire, and it v/as 
consequently omitted from the study. Thus, the attitude analysis 
consisted only of the instructors' questionnaire. In the monthly 
test groups, the instructors noted that the quality of questions 
asked in doss and subsequent discussion were poor; the instructors 
attributed this situation to poor motivation because of the experi- 
mental treatment of monthly tests. Similar comments were made by 
the instructors of the v/eekly test groups. On the other hand, the 
instructors considered the daily test groups to exhibit superior 
discussion quality and'botter motivation; also, even though many 
subjects objected to daily quizzes at the start of the study, as 
time progressed nore and more subjects sav/ the advantages in the 
daily quiz program. Further, instructors felt that all schedules 
of testing in general, and the daily testing program in particular, 
aided them with scheduling and preparing lessons and moving along 
with relatively un i form progress . 

Koester (1957), although not a formal study as such, reported 
on the reactions of fifty students in two graduate-level university 
classes on frequent testing. He claims (p. 207) that the students* 
opinions have been . . very favorable in terms of interest, moti- 
vation, and a feeling of having clarified basic principles.'* 

Selakovich (1962, p. 180) reported: "The students in the 
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cxpcr irncnlol section v;crc asked their opinion on the 'pop qui;^zcs' 
and there vas a near ununinily of opinion favorable to the technique. 
. . . Most of the studerits v;ho took 'pop quizi^es* felt it helped them 
learn the bosic information required in the course even though the 
results of the experiment indicate this was not true." 

Gaier (1962, p. 561) claims, "A consideration of the methods and 
techniques employed by students in their approach and preparation for 
a test situation has not, as yet, brought into being a body of sys- 
tematic research. In spite of the perfection of testing tools, test 
situations are frequently perceived by both students and teachers 
as forms of punishnient-^mi Id or otherwise, depending on the diffi- 
culty of the testing Inst rument--rathcr than a learning experience." 
Seven intact classes of educational psychology at the university 
level were used. The subjects were instructed on the response form 
(p. 561), "Assume that you v/i 1 1 receive a letter grade of ["A" or 
"D"] on the test you are to take. List the specific activities, 
either on your part or on the part of the instructor, that you feel 
were influential or responsible in making this grade." This response 
form was given out on the same day's class as the first quiz of the 
semester v/as to be given; the subjects had to complete and return 
the forms before they were given the actual quiz. The assumed 
grades of "A" or "D" were distributed as follows: 90 women and ^6 
men received the "A" forms, while 96 women and men received the 
"0" forms. 

All responses were categorized according to contents in phrases 
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or scnlcnccs. For ihc *'A**-9rndc responsc-fonn subjects, four cate- 
gories were uscJ to classify responses: (a) 'Success due to self and 
scl f^c^cti vi I ic:,*\ (b) 'Success due ro lcaclicr'\ (c) ''success due to 
extcrr.ul factors", and (d) "denial of the possibility of receiving 
an "A". For the "D"-grade response-form subjects, four similar cate- 
gories v/cre used to classify responses: (a) "foilurc due to self", 
(b) "failure due to teacher", (c) "failure due to internal factors", 
and (d) "not classifiable". In the reviev/cr's opinion, this dual 
classification schcn.o is the only useful result of Gaier's study; 
from such a schcr.'e an objective attitude questionnaire could be de- 
veloped that would lap those attitudes about testing that students 
think of most. 

The reviev.er does not put much faith in the validity of Gaier's 
percentage statistics with respect to the classification of the stu- 
dents' responses etcording to the above dual scheme. One difficulty 
in this interesting study is that each student could make as many or 
as fcv/ responses in as many or as fev/ categories as he desired. Thus, 
when Gaier goes on to compute v/hat percent of total responses v/ere 
attributed to one reason category, the relativity anxDng individuals 
(the truly important thing--not the highly variable relativity among 

individual responses) is destroyed: everything is distorted, because 
* 

each individual may have over^-emphast zed one possible category in 
relation to another. The whole situation is analogous to a very 
unstructured i nterview. 

Hawk and DeRidder (1963) » whose experiment in test grades v/as 
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dcltiilc'd Ciirlii-r, oPjO iluJicd sludcnl ulliluJcs lowcird inforn.it 
Icsls, Al the end of Ihc cxpcritrenl , a faculty rcriber v.ho had not 
l.Tik<jn pnri in liic cunHuci of ihc slajy v.ont utound io all «scclIonr. 
and i-^iMpIcd *»lijdonl alliluJcs, 1 1 v.as foji^d ihol 'students in the 
prcgradcd groups v.orkc-d lcs^» hord .irid v.crc r/.uch Icsi r.ioli voted in 
gcncrol than ;:crc liiose studcnlb in ihc groups* under Ihc usual 
grading procedure, 

Curo (13f63)» wlxsc frequency of lc:illntj :t'jdy v;ai alrcadv dis- 
cust^ed, dcnil with allilud^M* lOv.ard f r ^q^icnt Irsts by nvons of a 
questionnaire and inJividuol intcrvicv/: . The quc-s 1 5 on.'ja { rc remained 
anonyrrous and subjects were deliberately av.kcd to bo honest. The 
12-itcn, yes-no quest ionnai rc ivas gl/^rn only lo liic daily test groups. 
As might ho expected^ since all twelve questions v.cre clearly di- 
rected tov/ard the Hu.vthorne-produc! ng experimental treatment, most 
subjects answered in a positive halo-effect sense in favor of daily 
quizzes. No statistical tests were run; only rough pcrcents v/erc 
given. No reliability or validity evidence was cited for the ques- 
tionnaire. Since the interviews were of open-end type, the findings 
were too divergent to discuss systematically and concisely here. 

Test anxfcty will also be considced with *'att i tudcs** in this 
rcvicv/» since its mani festal ion is usually measured by a paper and 
pencil attitude questionnaire. A multitude of studies have been 
done on test anxicfy. Hov/ever, with respect to the reviewer's dis** 
sertation experiment (the learning benefits that result from frequent 
testing In the clenentary school), only Laldlaw (1963) attempted 
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any measure of test anxiety. He administered the specially developed 
"Test Behavior Questionnaire" (TBQ) of Hayes {\SCO) \ unfortunately, 
this v/as a relatively unes tabl i shed research instrument v/liose tech- 
nical merits are still in doubt. Tv/o equivalent forms are available: 
A and B. The alternate-form reliability is .63. Each form contains 
33 statements of the agree-d i sagree type. Laidlav/ says (p. 22), "High 
scores indicated considerable irrelevant or interfering behavior in 
a test situation, v/hilc lov/ scores indicated little such behavior." 

TBQ-A v;as given at both the start and finish of the frequency 
of testing experiment (described previously) • TBQ-A (pretest) was 
used as covariate for the criterion of TBQ-A (posttest). Homogen- 
eity of regression v/as satisfied. However, highly insignificant 

results were obtained: X^.onthly (adj.)> \.cekly (adj.) <^ 
Further, the weekly and monthly test groups v;ere pooled and then 
broken down on paper into high and lov/ ability groups on the basis 
of the present course's grades. Only the upper 2?!^ and lower 27% 
were considered. Again, TBQ-A; \pretcst) was used as covariate for 
the criterion of TBQ-A (posttest). Homogeneity of regression v/as 
satisfied. Marginally significant results were obtained: Xj^^ ^^^j j 

> '^high (adj.) <-''5<'" <■">>• 

• Finally, Laidlav/ had asked all subjects in all groups to put 
down in v/riting hov/ often they wanted to be tested. By the end of 
the experiment in the weekly test group, 86% of the students favored 
frequent tests, compared to 52% at the start. By the end of the 
experiment in the monthly test group, 66% of the students favored 
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frequent tcsts», con.parcd to GhZ at the start. No tests of signi- 
ficance arc proviJed. 

In connection with test anxiety, Laidlaw tries to rationalize 
his results (p. ^5): '*The group that v;as tested each week was tested 
four tin'ics more frequently than the one tested each month. Each 
weekly test accounted for a smaller proportion of the total course 
evaluation, so the risk associated v/ith a weekly test was much 
smaller. In spite of the difference, the weekly tested group did 
not learn to cope with tests with less irrelevant behavior under the 
reduced risk condition." 

In connect ion wi th the insignificant test anxiety results, the 
revicv/cr thinks it also should be noted that TBQ-A was given to both 
groups by Laidlaw at both the start of the study and at the end of 
the study. Thus it could have been expected that no significant 
differences v/ould result, since the students were done with the 
course, and correspondingly, the fear ov tests should have decreased 
greatly. On the other hand, if the TBQ had been given dur ing the 
testing process, significant differences (those of '^manipulatable 
process* type, as compared with **predisposi tional state*^ type) might 
have been more readily obtained. 

Nolan (I96M also studied attitudes. The failure of his Work 
Persistence Attitude Scale to function the way a reliable and valid 
measuring instrument should, has already been discussed above in 
connection with Nolan's test grading experiment. 
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' REV I EW TOPIC ELEVEN : TEST TYPE AS_ AN ASPECT 
OF TESTING AS A LEARNING DEVICE 

This topic might well have been combined with ''student prepara- 
tion for tests", the ninth topic. However, there are a few other 
distinct points the reviewer wants to make in connection with test 
type other than the inducing of different study habits. 

Nonresearch References : Ballard (1925) makes claims that true- 
false items yield more test learning benefits than do other types 
of test items. McCal) (1920) provides arguments in favor of the 
objective types of tests over the essay test with respect to 
didactic value. 

Experimental Studies : Remmers and Remmers (1925) compared 
true-false items with recall (or completion) items; again, accord- 
ing to them, the didactic value of true-false items cannot be 
denied. Cocks (1929) found superior didactic results for true- 
false tests as compared to multiple-choice and completion types 
of tests. However, considering pretesting, Jersild (1929) found 
true-false items to be didactically inferior to multiple-choice 
and essay i tems. 
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REVIEW TOPIC TWELVE : ** TEST-LIKE EVENTS" 
AS AN ASPECT OF TESTING AS A LEARNING DEVICE 

As already stated in the introduction, *'test-like events" are 
included in this reviev/ because of their implications for further 
research in relation to furthering knowledge about testing as a 
learning devic^2, Rothkopf's "test-like event" procedures provide 
a tightly controlled environment for investigating more basic 
issues of tests as to just what it is that causes one to learn more 
than he would otherwise had he not taken the test. Basically, as 
already explained in detail in the introduction, "test-like events" 
are study-guide questions inserted in reading passages or assign- 
ments when given in class (that is, this approximates a test situa- 
tion in its evaluative aspects as compared to study-guide questions 
given as outside class homework where the study situation is too 
informal and nontestlike for inclusion here). 

Non research References : Cason (1939), acting as theorist rather 
than experimenter, recommended the use of in-class, study-guide 
worksheets. Langman (19^3, p* 53^) offered negative criticism: 
"Perhaps this passivity in responding- to reading materials, ex- 
pressed by students in requests for syl labl, outl ines, and study 
questions [that is, "test-like events"], is in part the result of 
our recent teaching methods, which emphasize the provision of ex- 
ternal motivation by means of such study materials. Such motiva- 
tion is artificial ." 
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Exper imcnta 1 Studi es ; In connection with study-guide questions, 
Hertzberg, Heilman, and Leuenberger (1932) studied college sophonxjres 
in educational psychology. The experiment matched spring semester 
(E) students with winter semester (C) students. Three different 
comparisons were made. The difference betv;een E and C was simply 
that E was given several representative examinations of the course 
from which they could study throughout the duration of the course. 
The first comparison was just on the subject matter of the first 
unit of work of the semester. An examination of just the first 
unit was given to both E and C; E did significantly better than C. 
The second major comparison of this study concerned the work of 
units two through six of the semester. A special examination on 
these five units was administered to both groups. Again, E did 
significantly better than C. However, on the third comparison of 
this study (the final examination), no significant difference was 
found . 

The Motion Picture Research Project (19^7, p. 256) dealt with 
"A procedure which required pupils to participate more actively 
during the film showing by answering questions [inserted in the 
film] about various points just after they were presented.'* Study- 
guide-question film groups achieved higher results than nonstudy- 
guide-question film groups. McKeachie and Hiler (1951, p. 22^) 
said, '*Every subject matter, be it science, literature, or the arts, 
is an organized body of knowledge, not a mere array of isolated 
facts; hence a knowledge of this subject matter should be an 
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organized structure within the student's mind. Expectations, ques- 
tions, and problems are intrinsic to all such organized structures," 
The investigators performed an experiment in elementary psychology 
at the university level. Worksheets were used in class to guide the 
independent study of subjects; those who had to complete and turn 
In the study-guide worksheets were superior to the usual unguided 
study group on posttests given at the close of the experiment. 
Robinson (1926), Hurd {1931a, b), Greene (193^), Harrington and 
Lippert (193^), and Anderson (19^2) all performed experiments simi- 
lar to that of McKeachie and Hiler (1951). 

Finally, Rothkopf (I963, 1965, 1966a, b,c, 1968), Rothkopf and 
Bisbicos (1967), and Bruning (I968) have dealt essentially with com- 
pletion-type review questions inserted in the text itself; in this 
respect it begins to approximate programed instruction but is still 
not the same because of the lack of "framing" and because of the 
retention of traditional reading passage format. These investigators 
have left out certain types of words (quantifiers, adjectives, nouns, 
and so on) and tried to relate this to such things as the degree 
of relatedness of such omissions to the text or to the real test 
questions that follow the reading passage. The whole advantage to 
Rothkopf's procedures is that one can gain a great deal of control 
in studying an effect such as content structuring in test-like 
situations, that was hitherto unavailable. 
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